Selenium WebDriver history and today

来源:互联网 发布:智能算法30个案例分析 编辑:程序博客网 时间:2024/06/08 02:17

Address: http://www.aosabook.org/en/selenium.html

Selenium is a browser automation tool, commonly used for writingend-to-end tests of web applications. A browser automation tooldoes exactly what you would expect: automate the control of abrowser so that repetitive tasks can be automated. It sounds likea simple problem to solve, but as we will see, a lot has to happenbehind the scenes to make it work.

Before describing the architecture of Selenium it helps to understandhow the various related pieces of theproject fit together. At a very high level, Selenium is a suite ofthree tools. The first of these tools, SeleniumIDE, is an extension for Firefox that allows users to record andplayback tests. The record/playback paradigm can be limiting and isn'tsuitable for many users, so the second tool in the suite, SeleniumWebDriver, provides APIs in a variety of languages to allow for morecontrol and the application of standard software developmentpractices. The final tool, Selenium Grid, makes it possible to use theSelenium APIs to control browser instances distributed over a grid ofmachines, allowing more tests to run in parallel. Within the project,they are referred to as "IDE", "WebDriver" and "Grid". Thischapter explores the architecture of Selenium WebDriver.

This chapter was written during the betas of Selenium 2.0 in late2010. If you're reading the book after then, then things will havemoved forward, and you'll be able to see how the architectural choicesdescribed here have unfolded. If you're reading before that date:Congratulations! You have a time machine. Can I have some winninglottery numbers?

16.1. History

Jason Huggins started the Selenium project in 2004 while working atThoughtWorks on their in-house Time and Expenses (T&E) system, whichmade extensive use of Javascript. Although Internet Explorer was thedominant browser at the time, ThoughtWorks used a number ofalternative browsers (in particular Mozilla variants) and would filebug reports when the T&E app wouldn't work on their browser ofchoice. Open Source testing tools at the time were either focused on asingle browser (typically IE) or were simulations of a browser (likeHttpUnit). The cost of a license for a commercial tool would haveexhausted the limited budget for a small in-house project, so theyweren't even considered as viable testing choices.

Where automation is difficult, it's common to rely on manualtesting. This approach doesn't scale when the team is very small orwhen releases are extremely frequent. It's also a waste of humanity toask people to step through a script that could be automated. Moreprosaically, people are slower and more error prone than a machine fordull repetitive tasks. Manual testing wasn't an option.

Fortunately, all the browsers being tested supported Javascript. Itmade sense to Jason and the team he was working with to write atesting tool in that language which could be used to verify thebehavior of the application. Inspired by work being done onFIT1, a table-based syntax was placedover the raw Javascript and this allowed tests to be written by peoplewith limited programming experience using a keyword-driven approach inHTML files. This tool, originally called "Selenium" but laterreferred to as "Selenium Core", was released under the Apache 2license in 2004.

The table format of Selenium is structured similarly to theActionFixture from FIT. Each row of the table is split into threecolumns. The first column gives the name of the command to execute,the second column typically contains an element identifier and thethird column contains an optional value. For example, this is how totype the string "Selenium WebDriver" into an element identified withthe name "q":

type       name=q       Selenium WebDriver

Because Selenium was written in pure Javascript,its initial design required developers to host Core andtheir tests on the same server as the application under test(AUT) in order to avoid falling foul of the browser's securitypolicies and the Javascript sandbox. This was not alwayspractical or possible. Worse, although adeveloper's IDE gives them the ability to swiftly manipulate code andnavigate a large codebase, there is no such tool for HTML. It rapidlybecame clear that maintaining even a medium-sized suite of tests wasan unwieldy and painful proposition.2

To resolve this and other issues, an HTTP proxy was written so thatevery HTTP request could be intercepted by Selenium. Using thisproxy made it possible to side-step many of the constraints of the "same hostorigin" policy, where a browser won't allow Javascript to make callsto anything other than the server from which the current page has beenserved, allowing the first weakness to be mitigated. The design openedup the possibility of writing Selenium bindings in multiple languages:they just needed to be able to send HTTP requests to a particularURL. The wire format was closely modeled on the table-based syntax ofSelenium Core and it, along with the table-based syntax, becameknown as "Selenese". Because the language bindings were controllingthe browser at a distance, the tool was called "Selenium RemoteControl", or "Selenium RC".

While Selenium was being developed, another browser automationframework was brewing at ThoughtWorks: WebDriver. The initial code forthis was released early in 2007. WebDriver was derived from work onprojects which wanted to isolate their end-to-end tests from theunderlying test tool. Typically, the way that this isolation is doneis via the Adapter pattern. WebDriver grew out of insightdeveloped by applying this approach consistently over numerousprojects, and initially was a wrapper around HtmlUnit. InternetExplorer and Firefox support followed rapidly after release.

When WebDriver was released there were significant differences betweenit and Selenium RC, though they sat in the same software niche of anAPI for browser automation. The most obvious difference to a user wasthat Selenium RC had a dictionary-based API, with all methods exposedon a single class, whereas WebDriver had a more object-oriented API.In addition, WebDriver only supported Java, whereas Selenium RCoffered support for a wide-range of languages. There were also strongtechnical differences: Selenium Core (on which RC was based) wasessentially a Javascript application, running inside the browser'ssecurity sandbox. WebDriver attempted to bind natively to the browser,side-stepping the browser's security model at the cost ofsignificantly increased development effort for the framework itself.

In August, 2009, it was announced that the two projects would merge,and Selenium WebDriver is the result of those merged projects.As I write this, WebDriver supports language bindings for Java, C#,Python and Ruby. It offers support for Chrome, Firefox, InternetExplorer, Opera, and the Android and iPhone browsers. There are sisterprojects, not kept in the same source code repository but workingclosely with the main project, that provide Perl bindings, animplementation for the BlackBerry browser, and for "headless"WebKit—useful for those times where tests need to run on acontinuous integration server without a proper display. The originalSelenium RC mechanism is still maintained and allows WebDriver toprovide support for browsers that would otherwise be unsupported.

16.2. A Digression About Jargon

Unfortunately, the Selenium project uses a lot of jargon. To recapwhat we've already come across:

  • Selenium Core is the heart of the original Selenium implementation, and is a set of Javascript scripts that control the browser. This is sometimes referred to as "Selenium" and sometimes as "Core".
  • Selenium RC was the name given to the language bindings for Selenium Core, and is commonly, and confusingly, referred to as just "Selenium" or "RC". It has now been replaced by Selenium WebDriver, where RC's API is referred to as the "Selenium 1.x API".
  • Selenium WebDriver fits in the same niche as RC did, and has subsumed the original 1.x bindings. It refers to both the language bindings and the implementations of the individual browser controlling code. This is commonly referred to as just "WebDriver" or sometimes as Selenium 2. Doubtless, this will be contracted to "Selenium" over time.

The astute reader will have noticed that "Selenium" is used in afairly general sense. Fortunately, context normally makes it clearwhich particular Selenium people are referring to.

Finally, there's one more phrase which I'll be using, and there's nograceful way of introducing it: "driver" is the name given to aparticular implementation of the WebDriver API. For example, there isa Firefox driver, and an Internet Explorer driver.

16.3. Architectural Themes

Before we start looking at the individual pieces to understand howthey're wired together, it's useful to understand the the overarchingthemes of the architecture and development of the project. Succinctlyput, these are:

  • Keep the costs down.
  • Emulate the user.
  • Prove the drivers work…
  • …but you shouldn't need to understand how everything works.
  • Lower the bus factor.
  • Have sympathy for a Javascript implementation.
  • Every method call is an RPC call.
  • We are an Open Source project.

16.3.1. Keep the Costs Down

Supporting X browsers on Y platforms is inherently an expensiveproposition, both in terms of initial development and maintenance. Ifwe can find some way to keep the quality of the product high withoutviolating too many of the other principles, then that's the route wefavor. This is most clearly seen in our adoption of Javascript wherepossible, as you'll read about shortly.

16.3.2. Emulate the User

WebDriver is designed to accurately simulate the way that a user willinteract with a web application. A common approach for simulating userinput is to make use of Javascript to synthesize and fire the seriesof events that an app would see if a real user were to perform thesame interaction. This "synthesized events" approach is fraught withdifficulties as each browser, and sometimes different versions of thesame browser, fire slightly different events with slightly differentvalues. To complicate matters, most browsers won't allow a user tointeract in this way with form elements such as file inputelements for security reasons.

Where possible WebDriver uses the alternative approach of firingevents at the OS level. As these "native events" aren't generated bythe browser this approach circumvents the security restrictions placedon synthesized events and, because they are OS specific, once they areworking for one browser on a particular platform reusing the code inanother browser is relatively easy. Sadly, this approach is onlypossible where WebDriver can bind closely with the browserand where the development team have determined how best to sendnative events without requiring the browser window to be focused(as Selenium tests take a long time to run, and it's useful to beable to use the machine for other tasks as they run). At the timeof writing, this means that native events can be used on Linux andWindows, but not Mac OS X.

No matter how WebDriver is emulating user input, we try hard to mimicuser behavior as closely as possible. This in contrast to RC, whichprovided APIs that operated at a level far lower than that which auser works at.

16.3.3. Prove the Drivers Work

It may be an idealistic, "motherhood and apple pie" thing, but Ibelieve there's no point in writing code if it doesn't work. The waywe prove the drivers work on the Selenium project is to have anextensive set of automated test cases. These are typically"integration tests", requiring the code to be compiled and makinguse of a browser interacting with a web server, but where possible wewrite "unit tests", which, unlike an integration test can be runwithout a full recompilation. At the time of writing, there are about500 integration tests and about 250 unit tests that could be runacross each and every browser. We add more as we fix issues and writenew code, and our focus is shifting to writing more unit tests.

Not every test is run against every browser. Some test specificcapabilities that some browsers don't support, or which are handled indifferent ways on different browsers. Examples would include the testsfor new HTML5 features which aren't supported on all browsers. Despitethis, each of the major desktop browsers have a significant subset oftests run against them. Understandably, finding a way to run 500+tests per browser on multiple platforms is a significant challenge,and it's one that the project continues to wrestle with.

16.3.4. You Shouldn't Need to Understand How Everything Works

Very few developers are proficient and comfortable in every languageand technology we use. Consequently, our architecture needs to allowdevelopers to focus their talents where they can do the most good,without needing them to work on pieces of the codebase where they areuncomfortable.

16.3.5. Lower the Bus Factor

There's a (not entirely serious) concept in software developmentcalled the "bus factor". It refers to the number of key developerswho would need to meet some grisly end—presumably by being hit by abus—to leave the project in a state where it couldn'tcontinue. Something as complex as browser automation could beespecially prone to this, so a lot of our architectural decisions aremade to raise this number as high as possible.

16.3.6. Have Sympathy for a Javascript Implementation

WebDriver falls back to using pure Javascript to drive the browser ifthere is no other way of controlling it. This means that any API weadd should be "sympathetic" to a Javascript implementation.As a concrete example, HTML5 introduces LocalStorage, an API forstoring structured data on the client-side. This is typicallyimplemented in the browser using SQLite. A natural implementationwould have been to provide a database connection to the underlyingdata store, using something like JDBC.Eventually, we settled on an API that closely models the underlyingJavascript implementation because something that modeled typicaldatabase access APIs wasn't sympathetic to a Javascript implementation.

16.3.7. Every Call Is an RPC Call

WebDriver controls browsers that are running in other processes. Althoughit's easy to overlook it, this means that every call that is madethrough its API is an RPC call and therefore the performance of theframework is at the mercy of network latency. In normal operation,this may not be terribly noticeable—most OSes optimize routing tolocalhost—but as the network latency between the browser and thetest code increases, what may have seemed efficient becomes less so toboth API designers and users of that API.

This introduces some tension into the design of APIs. A larger API,with coarser functions would help reduce latency by collapsingmultiple calls, but this must be balanced by keeping the APIexpressive and easy to use. For example, there are several checks thatneed to be made to determine whether an element is visible to anend-user. Not only do we need to take into account various CSSproperties, which may need to be inferred by looking at parentelements, but we should probably also check the dimensions of theelement. A minimalist API would require each of these checks to bemade individually. WebDriver collapses all of them into a singleisDisplayed method.

16.3.8. Final Thought: This Is Open Source

Although it's not strictly an architectural point, Selenium is an OpenSource project. The theme that ties all the above points together isthat we'd like to make it as easy as possible for a new developer tocontribute. By keeping the depth of knowledge required as shallow aspossible, using as few languages as necessary and by relying onautomated tests to verify that nothing has broken, we hopefully enablethis ease of contribution.

Originally the project was split into a series of modules, with eachmodule representing a particular browser with additional modules forcommon code and for support and utility code. Source trees for eachbinding were stored under these modules. This approach made a lot ofsense for languages such as Java and C#, but was painful to work withfor Rubyists and Pythonistas. This translated almost directly intorelative contributor numbers, with only a handful of people able andinterested to work on the Python and Ruby bindings. To address this,in October and November of 2010 the source code was reorganized withthe Ruby and Python code stored under a single top-level directory perlanguage. This more closely matched the expectations of Open Sourcedevelopers in those languages, and the effect on contributions fromthe community was noticeable almost immediately.

16.4. Coping with Complexity

Software is a lumpy construct. The lumps are complexity, and asdesigners of an API we have a choice as where to push thatcomplexity. At one extreme we could spread the complexity as evenly aspossible, meaning that every consumer of the API needs to be party toit. The other extreme suggests taking as much of the complexity aspossible and isolating it in a single place. That single place wouldbe a place of darkness and terror for many if they have to venturethere, but the trade-off is that users of the API, who need not delveinto the implementation, have that cost of complexity paid up-frontfor them.

The WebDriver developers lean more towards finding and isolating thecomplexity in a few places rather than spreading it out. One reasonfor this is our users. They're exceptionally good at finding problemsand issues, as a glance at our bug list shows, but because many ofthem are not developers a complex API isn't going to work well. Wesought to provide an API that guides people in the right direction. Asan example, consider the following methods from the original SeleniumAPI, each of which can be used to set the value of an input element:

  • type
  • typeKeys
  • typeKeysNative
  • keydown
  • keypress
  • keyup
  • keydownNative
  • keypressNative
  • keyupNative
  • attachFile

Here's the equivalent in the WebDriver API:

  • sendKeys

As discussed earlier, this highlights one of the major philosophicaldifferences between RC and WebDriver in that WebDriver is striving toemulate the user, whereas RC offers APIs that deal at a lower levelthat a user would find hard or impossible to reach. The distinctionbetween typeKeys and typeKeysNative is that the formeralways uses synthetic events, whereas the latter attempts to use theAWT Robot to type the keys. Disappointingly, the AWT Robot sends thekey presses to whichever window has focus, which may not necessarilybe the browser. WebDriver's native events, by contrast, are sentdirectly to the window handle, avoiding the requirement that thebrowser window have focus.

16.4.1. The WebDriver Design

The team refers to WebDriver's API as being "object-based". Theinterfaces are clearly defined and try to adhere to having only asingle role or responsibility, but rather than modeling every singlepossible HTML tag as its own class we only have a singleWebElement interface. By following this approach developerswho are using an IDE which supports auto-completion can be led towardsthe next step to take. The result is that coding sessions maylook like this (in Java):

WebDriver driver = new FirefoxDriver();driver.<user hits space>

At this point, a relatively short list of 13 methods to pick fromappears. The user selects one:

driver.findElement(<user hits space>)

Most IDEs will now drop a hint about the type of the argumentexpected, in this case a "By". There are a number of preconfiguredfactory methods for "By" objects declared as static methods on theBy itself. Our user will quickly end up with a line of code that lookslike:

driver.findElement(By.id("some_id"));

Role-based Interfaces

Think of a simplified Shop class. Every day, it needs to berestocked, and it collaborates with aStockist to deliver thisnew stock. Every month, it needs to pay staff and taxes. For the sakeof argument, let's assume that it does this using anAccountant. One way of modeling this looks like:

public interface Shop {    void addStock(StockItem item, int quantity);    Money getSalesTotal(Date startDate, Date endDate);}

We have two choices about where to draw the boundaries when definingthe interface between the Shop, the Accountant and the Stockist. Wecould draw a theoretical line as shown inFigure 16.1.

This would mean that both Accountant and Stockist wouldaccept aShop as an argument to their respective methods. Thedrawback here, though, is that it's unlikely that the Accountantreally wants to stack shelves, and it's probably not a great idea forthe Stockist to realize the vast mark-up on prices that the Shop isadding. So, a better place to draw the line is shown inFigure 16.2.

We'll need two interfaces that the Shop needs to implement, but theseinterfaces clearly define the role that the Shop fulfills for both theAccountant and the Stockist. They are role-based interfaces:

public interface HasBalance {    Money getSalesTotal(Date startDate, Date endDate);}public interface Stockable {    void addStock(StockItem item, int quantity);}public interface Shop extends HasBalance, Stockable {}

I find UnsupportedOperationExceptions and their ilk deeplydispleasing, but there needs to be something that allows functionalityto be exposed for the subset of users who might need it withoutcluttering the rest of the APIs for the majority of users. To thisend, WebDriver makes extensive use of role-based interfaces. Forexample, there is aJavascriptExecutor interface that providesthe ability to execute arbitrary chunks of Javascript in the contextof the current page. A successful cast of a WebDriver instance to thatinterface indicates that you can expect the methods on it to work.

[Accountant and Stockist Depend on Shop]

Figure 16.1: Accountant and Stockist Depend on Shop

[Shop Implements HasBalance and Stockable]

Figure 16.2: Shop Implements HasBalance and Stockable

16.4.2. Dealing with the Combinatorial Explosion

One of the first things that is apparent from a moment's thought aboutthe wide range of browsers and languages that WebDriver supports isthat unless care is taken it would quickly face an escalating cost ofmaintenance. With X browsers and Y languages, it would be very easy tofall into the trap of maintaining X×Y implementations.

Reducing the number of languages that WebDriver supports would be oneway to reduce this cost, but we don't want to go down this route fortwo reasons. Firstly, there's a cognitive load to be paid whenswitching from one language to another, so it's advantageous to usersof the framework to be able to write their tests in the same languagethat they do the majority of their development work in. Secondly,mixing several languages on a single project is something that teamsmay not be comfortable with, and corporate coding standards andrequirements often seem to demand a technology monoculture (although,pleasingly, I think that this second point is becoming less true overtime), therefore reducing the number of supported languages isn't anavailable option.

Reducing the number of supported browsers also isn't an option—therewere vociferous arguments when we phased out support for Firefox 2 inWebDriver, despite the fact that when we made this choice itrepresented less than 1% of the browser market.

The only choice we have left is to try and make all the browsers lookidentical to the language bindings: they should offer a uniforminterface that can be addressed easily in a wide variety of languages.What is more, we want the language bindings themselves to be as easyto write as possible, which suggests that we want to keep them as slimas possible. We push as much logic as we can into the underlyingdriver in order to support this: every piece of functionality we failto push into the driver is something that needs to be implemented inevery language we support, and this can represent a significant amountof work.

As an example, the IE driver has successfully pushed theresponsibility for locating and starting IE into the main driverlogic. Although this has resulted in a surprising number of lines ofcode being in the driver, the language binding for creating a newinstance boils down to a single method call into that driver. Forcomparison, the Firefox driver has failed to make this change. In theJava world alone, this means that we have three major classes thathandle configuring and starting Firefox weighing in at around 1300lines of code. These classes are duplicated in every language bindingthat wants to support the FirefoxDriver without relying on starting aJava server. That's a lot of additional code to maintain.

16.4.3. Flaws in the WebDriver Design

The downside of the decision to expose capabilities in this way isthat until someone knows that a particular interface exists they maynot realize that WebDriver supports that type of functionality;there's a loss of explorability in the API. Certainly when WebDriverwas new we seemed to spend a lot of time just pointing people towardsparticular interfaces. We've now put a lot more effort into ourdocumentation and as the API gets more widely used it becomes easierand easier for users to find the information they need.

There is one place where I think our API is particularly poor. We havean interface calledRenderedWebElement which has a strangemish-mash of methods to do with querying the rendered state of theelement (isDisplayed,getSize and getLocation),performing operations on it (hover and drag and drop methods),and a handy method for getting the value of a particular CSSproperty. It was created because the HtmlUnit driver didn't expose therequired information, but the Firefox and IE drivers did. Itoriginally only had the first set of methods but we added the othermethods before I'd done hard thinking about how I wanted the API toevolve. The interface is well known now, and the tough choice iswhether we keep this unsightly corner of the API given that it'swidely used, or whether we attempt to delete it. My preference is notto leave a "broken window" behind, so fixing this before we releaseSelenium 2.0 is important.As a result, by the time you read this chapter,RenderedWebElement may well be gone.

From an implementor's point of view, binding tightly to a browser isalso a design flaw, albeit an inescapable one. It takes significanteffort to support a new browser, and often several attempts need to bemade in order to get it right. As a concrete example, the Chromedriver has gone through four complete rewrites, and the IE driver hashad three major rewrites too. The advantage of binding tightly to abrowser is that it offers more control.

16.5. Layers and Javascript

A browser automation tool is essentially built of three moving parts:

  • A way of interrogating the DOM.
  • A mechanism for executing Javascript.
  • Some means of emulating user input.

This section focuses on the first part: providing a mechanism tointerrogate the DOM. The lingua franca of the browser is Javascript,and this seems like the ideal language to use when interrogating theDOM. Although this choice seems obvious, making it leads to someinteresting challenges and competing requirements that need balancingwhen thinking about Javascript.

Like most large projects, Selenium makes use of a layered set oflibraries. The bottom layer is Google's Closure Library, whichsupplies primitives and a modularization mechanism allowing sourcefiles to be kept focused and as small as possible. Above this, thereis a utility library providing functions that range from simple taskssuch as getting the value of an attribute, through determining whetheran element would be visible to an end user, to far more complexactions such as simulating a click using synthesized events. Withinthe project, these are viewed as offering the smallest units ofbrowser automation, and so are called Browser Automation Atoms oratoms. Finally, there are adapter layers that compose atoms in orderto meet the API contracts of both WebDriver and Core.

[Layers of Selenium Javascript Library]

Figure 16.3: Layers of Selenium Javascript Library

The Closure Library was chosen for several reasons. The main one wasthat the Closure Compiler understands the modularization technique theLibrary uses. The Closure Compiler is a compiler targeting Javascriptas the output language. "Compilation" can be as simple as orderinginput files in dependency order, concatenating and pretty printingthem, or as complex as doing advanced minification and dead coderemoval. Another undeniable advantage was that several members of theteam doing the work on the Javascript code were very familiar withClosure Library.

This "atomic" library of code is used pervasively throughout theproject when there is a requirement to interrogate the DOM. For RCand those drivers largely composed of Javascript, the library is useddirectly, typically compiled as a monolithic script. For driverswritten in Java, individual functions from the WebDriver adapter layerare compiled with full optimization enabled, and the generatedJavascript included as resources in the JARs. For drivers written in Cvariants, such as the iPhone and IE drivers, not only are theindividual functions compiled with full optimization, but thegenerated output is converted to a constant defined in a header whichis executed via the driver's normal Javascript execution mechanism ondemand. Although this seems like a strange thing to do, it allows theJavascript to be pushed into the underlying driver without needing toexpose the raw source in multiple places.

Because the atoms are used pervasively it's possible to ensureconsistent behavior between the different browsers, and because thelibrary is written in Javascript and doesn't require elevatedprivileges to execute the development cycle, is easy and fast. TheClosure Library can load dependencies dynamically, so the Seleniumdeveloper need only write a test and load it in a browser, modifyingcode and hitting the refresh button as required. Once the test ispassing in one browser, it's easy to load it in another browser andconfirm that it passes there. Because the Closure Library does a goodjob of abstracting away the differences between browsers, this isoften enough, though it's reassuring to know that there are continuousbuilds that will run the test suite in every supported browser.

Originally Core and WebDriver had many areas of congruent code—codethat performed the same function in slightly different ways. When westarted work on the atoms, this code was combed through to try andfind the "best of breed" functionality. After all, both projects hadbeen used extensively and their code was very robust so throwing awayeverything and starting from scratch would not only have been wastefulbut foolish. As each atom was extracted, the sites at which it wouldbe used were identified and switched to using the atom. For example,the Firefox driver's getAttribute method shrunk fromapproximately 50 lines of code to 6 lines long, including blank lines:

FirefoxDriver.prototype.getElementAttribute =  function(respond, parameters) {  var element = Utils.getElementAt(parameters.id,                                   respond.session.getDocument());  var attributeName = parameters.name;  respond.value = webdriver.element.getAttribute(element, attributeName);  respond.send();};

That second-to-last line, where respond.value is assigned to,is using the atomic WebDriver library.

The atoms are a practical demonstration of several of thearchitectural themes of the project. Naturally they enforce therequirement that an implementation of an API be sympathetic to aJavascript implementation. What's even better is that the same libraryis shared throughout the codebase; where once a bug had to be verifiedand fixed across multiple implementations, it is now enough to fix thebug in one place, which reduces the cost of change while improvingstability and effectiveness. The atoms also make the bus factor of theproject more favorable. Since a normal Javascript unit test can beused to check that a fix works the barrier to joining the Open Sourceproject is considerably lower than it was when knowledge of how eachdriver was implemented was required.

There is another benefit to using the atoms. A layer emulating theexisting RC implementation but backed by WebDriver is an importanttool for teams looking to migrate in a controlled fashion to the newerWebDriver APIs. As Selenium Core is atomized it becomes possible tocompile each function from it individually, making the task of writingthis emulating layer both easier to implement and more accurate.

It goes without saying that there are downsides to the approach taken.Most importantly, compiling Javascript to a Cconst is a verystrange thing to do, and it always baffles new contributors to theproject who want to work on the C code. It is also a rare developerwho has every version of every browser and is dedicated enough to runevery test in all of those browsers—it is possible for someone toinadvertently cause a regression in an unexpected place, and it cantake some time to identify the problem, particularly if the continuousbuilds are being flaky.

Because the atoms normalize return values between browsers, there canalso be unexpected return values. For example, consider this HTML:

<input name="example" checked>

The value of the checked attribute will depend on the browserbeing used. The atoms normalize this, and other Boolean attributesdefined in the HTML5 spec, to be "true" or "false". When this atomwas introduced to the code base, we discovered many places wherepeople were making browser-dependent assumptions about what the returnvalue should be. While the value was now consistent there was anextended period where we explained to the community what had happenedand why.

16.6. The Remote Driver, and the Firefox Driver in Particular

The remote WebDriver was originally a glorified RPC mechanism. It hassince evolved into one of the key mechanisms we use to reduce the costof maintaining WebDriver by providing a uniform interface thatlanguage bindings can code against. Even though we've pushed as muchof the logic as we can out of the language bindings and into thedriver, if each driver needed to communicate via a unique protocol wewould still have an enormous amount of code to repeat across all thelanguage bindings.

The remote WebDriver protocol is used wherever we need to communicatewith a browser instance that's running out of process. Designing thisprotocol meant taking into consideration a number of concerns. Most ofthese were technical, but, this being open source, there was also thesocial aspect to consider.

Any RPC mechanism is split into two pieces: the transport and theencoding. We knew that however we implemented the remote WebDriverprotocol, we would need support for both pieces in the languages wewanted to use as clients. The first iteration of the design wasdeveloped as part of the Firefox driver.

Mozilla, and therefore Firefox, was always seen as being amulti-platform application by its developers. In order to facilitatethe development, Mozilla created a framework inspired by Microsoft'sCOM that allowed components to be built and bolted together calledXPCOM (cross-platform COM). An XPCOM interface is declared using IDL,and there arelanguage bindings for C and Javascript as well as other languages. BecauseXPCOM is used to construct Firefox, and because XPCOM has Javascriptbindings, it's possible to make use of XPCOM objects in Firefoxextensions.

Normal Win32 COM allows interfaces to be accessed remotely. There wereplans to add the same ability to XPCOM too, and Darin Fisher added anXPCOM ServerSocket implementation to facilitate this. Although theplans for D-XPCOM never came to fruition, like an appendix, thevestigial infrastructure is still there. We took advantage of this tocreate a very basic server within a custom Firefox extensioncontaining all the logic for controlling Firefox. The protocol usedwas originally text-based and line-oriented, encoding all strings asUTF-2. Each request or response began with a number, indicating howmany newlines to count before concluding that the request or reply hadbeen sent. Crucially, this scheme was easy to implement in Javascriptas SeaMonkey (Firefox's Javascript engine at the time) storesJavascript strings internally as 16 bit unsigned integers.

Although futzing with custom encoding protocols over raw sockets is afun way to pass the time, it has several drawbacks. There were nowidely available libraries for the custom protocol, so it needed to beimplemented from the ground up for every language that we wanted tosupport. This requirement to implement more code would make it lesslikely that generous Open Source contributors would participate in thedevelopment of new language bindings. Also, although a line-orientedprotocol was fine when we were only sending text-based data around, itbrought problems when we wanted to send images (such as screenshots)around.

It became very obvious, very quickly that this original RPC mechanismwasn't practical. Fortunately, there was a well-known transport thathas widespread adoption and support in almost every language thatwould allow us to do what we wanted: HTTP.

Once we had decided to use HTTP for a transport mechanism, the nextchoice that needed to be made was whether to use a single end-point(à la SOAP) or multiple end points (in the style of REST) Theoriginal Selenese protocol used a single end-point and had encodedcommands and arguments in the query string. While this approach workedwell, it didn't "feel" right: we had visions of being able toconnect to a remote WebDriver instance in a browser to view the stateof the server. We ended up choosing an approach we call "REST-ish":multiple end-point URLs using the verbs of HTTP to help providemeaning, but breaking a number of the constraints required for a trulyRESTful system, notably around the location of state and cacheability,largely because there is only one location for the application stateto meaningfully exist.

Although HTTP makes it easy to support multiple ways of encoding databased on content type negotiation, we decided that we needed acanonical form that all implementations of the remote WebDriverprotocol could work with. There were a handful of obvious choices:HTML, XML or JSON. We quickly ruled out XML: although it's aperfectly reasonable data format and there are libraries that supportit for almost every language, my perception of how well-liked it is inthe Open Source community was that people don't enjoy working withit. In addition, it was entirely possible that although the returneddata would share a common "shape" it would be easy for additionalfields to be added3. Although these extensions could bemodeled using XML namespaces this would start to introduce Yet MoreComplexity into the client code: something I was keen to avoid. XMLwas discarded as an option. HTML wasn't really a good choice, as weneeded to be able to define our own data format, and though anembedded micro-format could have been devised and used that seems likeusing a hammer to crack an egg.

The final possibility considered was Javascript Object Notation(JSON). Browsers can transform a string into an object using either astraight call toeval or, on more recent browsers, withprimitives designed to transform a Javascript object to and from astring securely and without side-effects. From a practicalperspective, JSON is a popular data format with libraries for handlingit available for almost every language and all the cool kids likeit. An easy choice.

The second iteration of the remote WebDriver protocol therefore usedHTTP as the transport mechanism and UTF-8 encoded JSON as the defaultencoding scheme. UTF-8 was picked as the default encoding so thatclients could easily be written in languages with limited support forUnicode, as UTF-8 is backwardly compatible with ASCII. Commands sentto the server used the URL to determine which command was being sent,and encoded the parameters for the command in an array.

For example a call to WebDriver.get("http://www.example.com")mapped to a POST request to a URL encoding the session ID and endingwith "/url", with the array of parameters looking like{[}'http://www.example.com'{]}. The returned result was alittle more structured, and had place-holders for a returned value andan error code. It wasn't long until the third iteration of remoteprotocol, which replaced the request's array of parameters with adictionary of named parameters. This had the benefit of makingdebugging requests significantly easier, and removed the possibilityof clients mistakenly mis-ordering parameters, making the system as awhole more robust. Naturally, it was decided to use normal HTTP errorcodes to indicate certain return values and responses where they werethe most appropriate way to do so; for example, if a user attempts tocall a URL with nothing mapped to it, or when we want to indicate the"empty response".

The remote WebDriver protocol has two levels of error handling, onefor invalid requests, and one for failed commands. An example of aninvalid request is for a resource that doesn't exist on the server, orperhaps for a verb that the resource doesn't understand (such assending a DELETE command to the the resource used for dealing with theURL of the current page) In those cases, a normal HTTP 4xx response issent. For a failed command, the responses error code is set to 500("Internal Server Error") and the returned data contains a moredetailed breakdown of what went wrong.

When a response containing data is sent from the server, it takes theform of a JSON object:

KeyDescriptionsessionIdAn opaque handle used by the server to determine where to route session-specific commands.statusA numeric status code summarizing the result of the command. A non-zero value indicates that the command failed.valueThe response JSON value.

An example response would be:

{  sessionId: 'BD204170-1A52-49C2-A6F8-872D127E7AE8',  status: 7,  value: 'Unable to locate element with id: foo'}

As can be seen, we encode status codes in the response, with anon-zero value indicating that something has gone horribly awry. TheIE driver was the first to use status codes, and the values used inthe wire protocol mirror these. Because all error codes are consistentbetween drivers, it is possible to share error handling code betweenall the drivers written in a particular language, making the job ofthe client-side implementors easier.

The Remote WebDriver Server is simply a Java servlet that acts as amultiplexer, routing any commands it receives to an appropriateWebDriver instance. It's the sort of thing that a second year graduatestudent could write. The Firefox driver also implements the remoteWebDriver protocol, and its architecture is far more interesting, solet's follow a request through from the call in the language bindingsto that back-end until it returns to the user.

Assuming that we're using Java, and that "element" is an instance ofWebElement, it all starts here:

element.getAttribute("row");

Internally, the element has an opaque "id" that the server-side usesto identify which element we're talking about. For the sake of thisdiscussion, we'll imagine it has the value "some_opaque_id". Thisis encoded into a JavaCommand object with a Map holdingthe (now named) parametersid for the element IDand name for the name of the attribute being queried.

A quick look up in a table indicates that the correct URL is:

/session/:sessionId/element/:id/attribute/:name

Any section of the URL that begins with a colon is assumed to be avariable that requires substitution. We've been given theidand name parameters already, and the sessionId isanother opaque handle that is used for routing when a server canhandle more than one session at a time (which the Firefox drivercannot). This URL therefore typically expands to something like:

http://localhost:7055/hub/session/XXX/element/some_opaque_id/attribute/row

As an aside, WebDriver's remote wire protocol was originally developedat the same time as URL Templates were proposed as a draft RFC. Bothour scheme for specifying URLs and URL Templates allow variables to beexpanded (and therefore derived) within a URL. Sadly, although URLTemplates were proposed at the same time, we only became aware of themrelatively late in the day, and therefore they are not used todescribe the wire protocol.

Because the method we're executing is idempotent4, the correct HTTP method to use is aGET. We delegate down to a Java library that can handle HTTP (theApache HTTP Client) to call the server.

[Overview of the Firefox Driver Architecture]

Figure 16.4: Overview of the Firefox Driver Architecture

The Firefox driver is implemented as a Firefox extension, the basicdesign of which is shown inFigure 16.4.Somewhat unusually, it has an embedded HTTP server. Althoughoriginally we used one that we had built ourselves, writing HTTPservers in XPCOM wasn't one of our core competencies, so when theopportunity arose we replaced it with a basic HTTPD written by Mozillathemselves. Requests are received by the HTTPD and almost straightaway passed to adispatcher object.

The dispatcher takes the request and iterates over a known list ofsupported URLs, attempting to find one that matches the request. Thismatching is done with knowledge of the variable interpolation thatwent on in the client side. Once an exact match is found, includingthe verb being used, a JSON object, representing the command toexecute, is constructed. In our case it looks like:

{  'name': 'getElementAttribute',  'sessionId': { 'value': 'XXX' },  'parameters': {    'id': 'some_opaque_key',    'name': 'rows'  }}

This is then passed as a JSON string to a custom XPCOM component we'vewritten called the CommandProcessor. Here's the code:

var jsonResponseString = JSON.stringify(json);var callback = function(jsonResponseString) {  var jsonResponse = JSON.parse(jsonResponseString);  if (jsonResponse.status != ErrorCode.SUCCESS) {    response.setStatus(Response.INTERNAL_ERROR);  }  response.setContentType('application/json');  response.setBody(jsonResponseString);  response.commit();};// Dispatch the command.Components.classes['@googlecode.com/webdriver/command-processor;1'].    getService(Components.interfaces.nsICommandProcessor).    execute(jsonString, callback);

There's quite a lot of code here, but there are two key points. First,we converted the object above to a JSON string. Secondly, we pass acallback to the execute method that causes the HTTP response to besent.

The execute method of the command processor looks up the "name" todetermine which function to call, which it then does. The firstparameter given to this implementing function is a "respond"object (so called because it was originally just the function used tosend the response back to the user), which encapsulates not only thepossible values that might be sent, but also has a method that allowsthe response to be dispatched back to the user and mechanisms to findout information about the DOM. The second parameter is the value ofthe parameters object seen above (in this case,id andname). The advantage of this scheme is that each function has auniform interface that mirrors the structure used on the clientside. This means that the mental models used for thinking about thecode on each side are similar. Here's the underlying implementationof getAttribute, which you've seen before inSection 16.5:

FirefoxDriver.prototype.getElementAttribute = function(respond, parameters) {  var element = Utils.getElementAt(parameters.id,                                  respond.session.getDocument());  var attributeName = parameters.name;  respond.value = webdriver.element.getAttribute(element, attributeName);  respond.send();};

In order to make element references consistent, the first line simplylooks up the element referred to by the opaque ID in a cache. In theFirefox driver, that opaque ID is a UUID and the "cache" is simply amap. ThegetElementAt method also checks to see if thereferred to element is both known and attached to the DOM. If eithercheck fails, the ID is removed from the cache (if necessary) and anexception is thrown and returned to the user.

The second line from the end makes use of the browser automation atomsdiscussed earlier, this time compiled as a monolithic script andloaded as part of the extension.

In the final line, the send method is called. This does asimple check to ensure that we only send a response once before itcalls the callback given to the execute method. The response is sentback to the user in the form of a JSON string, which is decanted intoan object that looks like (assuming that getAttribute returned"7", meaning the element wasn't found):

{  'value': '7',  'status': 0,  'sessionId': 'XXX'}

The Java client then checks the value of the status field. If thatvalue is non-zero, it converts the numeric status code into anexception of the correct type and throws that, using the "value"field to help set the message sent to the user. If the status is zerothe value of the "value" field is returned to to the user.

Most of this makes a certain amount of sense, but there was one piecethat an astute reader will raise questions about: why did thedispatcher convert the object it had into a string before calling theexecute method?

The reason for this is that the Firefox Driver also supports runningtests written in pure Javascript. Normally, this would be an extremelydifficult thing to support: the tests are running in the context ofthe browser's Javascript security sandbox, and so may not do a rangeof things that are useful in tests, such as traveling between domainsor uploading files. The WebDriver Firefox extension, however, providesan escape hatch from the sandbox. It announces its presence by addingawebdriver property to the document element. The WebDriverJavascript API uses this as an indicator that it can add JSONserialized command objects as the value of acommand property onthe document element, fire a custom webdriverCommand event andthen listen for awebdriverResponse event on the same element tobe notified that the response property has been set.

This suggests that browsing the web in a copy of Firefox with theWebDriver extension installed is a seriously bad idea as it makes ittrivially easy for someone to remotely control the browser.

Behind the scenes, there is a DOM messenger, waiting for thewebdriverCommand this reads the serialized JSON object and callstheexecute method on the command processor. This time, thecallback is one that simply sets theresponse attribute on thedocument element and then fires the expectedwebdriverResponseevent.

16.7. The IE Driver

Internet Explorer is an interesting browser. It's constructed of anumber of COM interfaces working in concert. This extends all the wayinto the Javascript engine, where the familiar Javascript variablesactually refer to underlying COM instances. That Javascriptwindow is an IHTMLWindow. document is aninstance of the COM interfaceIHTMLDocument. Microsoft havedone an excellent job in maintaining existing behavior as theyenhanced their browser. This means that if an application worked withthe COM classes exposed by IE6 it will still continue to work withIE9.

The Internet Explorer driver has an architecture that's evolved overtime. One of the major forces upon its design has been a requirementto avoid an installer. This is a slightly unusual requirement, soperhaps needs some explanation. The first reason not to require aninstaller is that it makes it harder for WebDriver to pass the "5minute test", where a developer downloads a package and tries it outfor a brief period of time. More importantly, it is relatively commonfor users of WebDriver to not be able to install software on their ownmachines. It also means that no-one needs to remember to log on to thecontinuous integration servers to run an installer when a projectwants to start testing with IE. Finally, running installers just isn'tin the culture of some languages. The common Java idiom is to simplydrop JAR files on to the CLASSPATH, and, in my experience, thoselibraries that require installers tend not to be as well-liked orused.

So, no installer. There are consequences to this choice.

The natural language to use for programming on Windows would besomething that ran on .Net, probably C#. The IE driver integratestightly with IE by making use of the IE COM Automation interfaces thatship with every version of Windows. In particular, we use COMinterfaces from the native MSHTML and ShDocVw DLLs, which form part ofIE. Prior to C# 4, CLR/COM interoperability was achieved via the use ofseparate Primary Interop Assemblies (PIAs) A PIA is essentially agenerated bridge between the managed world of the CLR and that of COM.

Sadly, using C# 4 would mean using a very modern version of the .Netruntime, and many companies avoid living on the leading edge,preferring the stability and known issues of older releases. By usingC# 4 we would automatically exclude a reasonable percentage of ouruser-base. There are also other disadvantages to using a PIA. Considerlicensing restrictions. After consultation with Microsoft, it becameclear that the Selenium project would not have the rights todistribute the PIAs of either the MSHTML or ShDocVw libraries. Even ifthose rights had been granted, each installation of Windows and IE hasa unique combination of these libraries, which means that we wouldhave needed to ship a vast number of these things. Building the PIAson the client machine on demand is also a non-starter, as they requiredeveloper tools that may not exist on a normal user's machine.

So, although C# would have been an attractive language to do the bulkof the coding in, it wasn't an option. We needed to use somethingnative, at least for the communication with IE. The next naturalchoice for this is C++, and this is the language that we chose in theend. Using C++ has the advantage that we don't need to use PIAs, butit does mean that we need to redistribute the Visual Studio C++runtime DLL unless we statically link against them. Since we'd needto run an installer in order to make that DLL available, we staticallylink our library for communicating with IE.

That's a fairly high cost to pay for a requirement not to use aninstaller. However, going back to the theme of where complexity shouldlive, it is worth the investment as it makes our users' livesconsiderably easier. It is a decision we re-evaluate on an ongoingbasis, as the benefit to the user is a trade-off with the fact thatthe pool of people able to contribute to an advanced C++ Open Sourceproject seems significantly smaller than those able to contribute toan equivalent C# project.

The initial design of the IE driver is shown inFigure 16.5.

[Original IE Driver]

Figure 16.5: Original IE Driver

Starting from the bottom of that stack, you can see that we're usingIE's COM Automation interfaces. In order to make these easier to dealwith on a conceptual level, we wrapped those raw interfaces with a setof C++ classes that closely mirrored the main WebDriver API. In orderto get the Java classes communicating with the C++ we made use of JNI,with the implementations of the JNI methods using the C++ abstractionsof the COM interfaces.

This approach worked reasonably well while Java was the only clientlanguage, but it would have been a source of pain and complexity ifeach language we supported needed us to alter the underlyinglibrary. Thus, although JNI worked, it didn't provide the correctlevel of abstraction.

What was the correct level of abstraction? Every language that wewanted to support had a mechanism for calling down to straight Ccode. In C#, this takes the form of PInvoke. In Ruby there is FFI, andPython has ctypes. In the Java world, there is an excellent librarycalled JNA (Java Native Architecture). We needed to expose our APIusing this lowest common denominator. This was done by taking ourobject model and flattening it, using a simple two or three letterprefix to indicate the "home interface" of the method: "wd" for"WebDriver" and "wde" for WebDriver Element. ThusWebDriver.get becamewdGet, andWebElement.getText became wdeGetText. Each methodreturns an integer representing a status code, with "out" parametersbeing used to allow functions to return more meaningful data. Thus weended up with method signatures such as:

int wdeGetAttribute(WebDriver*, WebElement*, const wchar_t*, StringWrapper**)

To calling code, the WebDriver, WebElement andStringWrapper are opaque types: we expressed the difference inthe API to make it clear what value should be used as that parameter,though could just as easily have been "void *". You can also seethat we were using wide characters for text, since we wanted to dealwith internationalized text properly.

On the Java side, we exposed this library of functions via aninterface, which we then adapted to make it look like the normal object-orientedinterface presented by WebDriver. For example, the Java definition ofthegetAttribute method looks like:

public String getAttribute(String name) {  PointerByReference wrapper = new PointerByReference();  int result = lib.wdeGetAttribute(      parent.getDriverPointer(), element, new WString(name), wrapper);  errors.verifyErrorCode(result, "get attribute of");  return wrapper.getValue() == null ? null : new StringWrapper(lib, wrapper).toString();}

This lead to the design shown in Figure 16.6.

[Modified IE Driver]

Figure 16.6: Modified IE Driver

While all the tests were running on the local machine, this worked outwell, but once we started using the IE driver in the remote WebDriverwe started running into random lock ups. We traced this problem backto a constraint on the IE COM Automation interfaces. They are designedto be used in a "Single Thread Apartment" model. Essentially, thisboils down to a requirement that we call the interface from the samethread every time. While running locally, this happens bydefault. Java app servers, however, spin up multiple threads to handlethe expected load. The end result? We had no way of being sure thatthe same thread would be used to access the IE driver in all cases.

One solution to this problem would have been to run the IE driver in asingle-threaded executor and serialize all access via Futures in theapp server, and for a while this was the design we chose. However, itseemed unfair to push this complexity up to the calling code, and it'sall too easy to imagine instances where people accidentally make useof the IE driver from multiple threads. We decided to sink thecomplexity down into the driver itself. We did this by holding the IEinstance in a separate thread and using the PostThreadMessageWin32 API to communicate across the thread boundary. Thus, at the timeof writing, the design of the IE driver looks likeFigure 16.7.

[IE Driver as of Selenium 2.0 alpha 7]

Figure 16.7: IE Driver as of Selenium 2.0 alpha 7

This isn't the sort of design that I would have chosen voluntarily,but it has the advantage of working and surviving the horrors that ourusers may chose to inflict upon it.

One drawback to this design is that it can be hard to determinewhether the IE instance has locked itself solid. This may happen if amodal dialog opens while we're interacting with the DOM, or it mayhappen if there's a catastrophic failure on the far side of the threadboundary. We therefore have a timeout associated with every threadmessage we post, and this is set to what we thought was a relativelygenerous 2 minutes. From user feedback on the mailing lists, thisassumption, while generally true, isn't always correct, and laterversions of the IE driver may well make the timeout configurable.

Another drawback is that debugging the internals can be deeplyproblematic, requiring a combination of speed (after all, you've gottwo minutes to trace the code through as far as possible), thejudicious use of break points and an understanding of the expectedcode path that will be followed across the thread boundary. Needlessto say, in an Open Source project with so many other interestingproblems to solve, there is little appetite for this sort of grungywork. This significantly reduces the bus factor of the system, and asa project maintainer, this worries me.

To address this, more and more of the IE driver is being moved to situpon the same Automation Atoms as the Firefox driver and SeleniumCore. We do this by compiling each of the atoms we plan to use andpreparing it as a C++ header file, exposing each function as aconstant. At runtime, we prepare the Javascript to execute from theseconstants. This approach means that we can develop and test areasonable percentage of code for the IE driver without needing a Ccompiler involved, allowing far more people to contribute to findingand resolving bugs. In the end, the goal is to leave only theinteraction APIs in native code, and rely on the atoms as much aspossible.

Another approach we're exploring is to rewrite the IE driver to makeuse of a lightweight HTTP server, allowing us to treat it as a remoteWebDriver. If this occurs, we can remove a lot of the complexityintroduced by the thread boundary, reducing the total amount of coderequired and making the flow of control significantly easier tofollow.

16.8. Selenium RC

It's not always possible to bind tightly to a particular browser. Inthose cases, WebDriver falls back to the original mechanism used bySelenium. This means using Selenium Core, a pure Javascript framework,which introduces a number of drawbacks as it executes firmly in thecontext of the Javascript sandbox. From a user of WebDriver's APIsthis means that the list of supported browsers falls into tiers, withsome being tightly integrated with and offering exceptional control,and others being driven via Javascript and offering the same level ofcontrol as the original Selenium RC.

Conceptually, the design used is pretty simple, as you can see inFigure 16.8.

[Outline of Selenium RC's Architecture]

Figure 16.8: Outline of Selenium RC's Architecture

As you can see, there are three moving pieces here: the client code,the intermediate server and the Javascript code of Selenium Corerunning in the browser. The client side is just an HTTP client thatserializes commands to the server-side piece. Unlike the remoteWebDriver, there is just a single end-point, and the HTTP verb used islargely irrelevant. This is partly because the Selenium RC protocol isderived from the table-based API offered by Selenium Core, and thismeans that the entire API can be described using three URL queryparameters.

When the client starts a new session, the Selenium server looks up therequested "browser string" to identify a matching browserlauncher. The launcher is responsible for configuring and starting aninstance of the requested browser. In the case of Firefox, this is assimple as expanding a pre-built profile with a handful of extensionspre-installed (one for handling a "quit" command, and another formodeling "document.readyState" which wasn't present on olderFirefox releases that we still support). The key piece ofconfiguration that's done is that the server configures itself as aproxy for the browser, meaning that at least some requests (those for"/selenium-server") are routed through it. Selenium RC can operatein one of three modes: controlling a frame in a single window("singlewindow" mode), in a separate window controlling the AUT in asecond window ("multiwindow" mode) or by injecting itself into thepage via a proxy ("proxyinjection" mode). Depending on the mode ofoperation, all requests may be proxied.

Once the browser is configured, it is started, with an initial URLpointing to a page hosted on the Seleniumserver—RemoteRunner.html. This page is responsible forbootstrapping the process by loading all the required Javascript filesfor Selenium Core. Once complete, the "runSeleniumTest" function iscalled. This uses reflection of theSelenium object toinitialize the list of available commands that are available beforekicking off the main command processing loop.

The Javascript executing in the browser opens an XMLHttpRequest to aURL on the waiting server (/selenium-server/driver), relying onthe fact that the server is proxying all requests to ensure that therequest actually goes somewhere valid. Rather than making a request,the first thing that this does is send the response from thepreviously executed command, or "OK" in the case where the browseris just starting up. The server then keeps the request open until anew command is received from the user's test via the client, which isthen sent as the response to the waiting Javascript. This mechanismwas originally dubbed "Response/Request", but would now be morelikely to be called "Comet with AJAX long polling".

Why does RC work this way? The server needs to be configured as aproxy so that it can intercept any requests that are made to itwithout causing the calling Javascript to fall foul of the "SingleHost Origin" policy, which states that only resources from the sameserver that the script was served from can be requested viaJavascript. This is in place as a security measure, but from thepoint of view of a browser automation framework developer, it's prettyfrustrating and requires a hack such as this.

The reason for making an XmlHttpRequest call to the server istwo-fold. Firstly, and most importantly, until WebSockets, a part ofHTML5, become available in the majority of browsers there is no way tostart up a server process reliably within a browser. That means thatthe server had to live elsewhere. Secondly, an XMLHttpRequest callsthe response callback asynchronously, which means that while we'rewaiting for the next command the normal execution of the browser isunaffected. The other two ways to wait for the next command would havebeen to poll the server on a regular basis to see if there was anothercommand to execute, which would have introduced latency to the userstests, or to put the Javascript into a busy loop which would havepushed CPU usage through the roof and would have prevented otherJavascript from executing in the browser (since there is only ever oneJavascript thread executing in the context of a single window).

Inside Selenium Core there are two major moving pieces. These are themain selenium object, which acts as the host for all availablecommands and mirrors the API offered to users. The second piece is thebrowserbot. This is used by the Selenium object to abstractaway the differences present in each browser and to present anidealized view of commonly used browser functionality. This means thatthe functions inselenium are clearer and easier to maintain,whilst the browserbot is tightly focused.

Increasingly, Core is being converted to make use of the AutomationAtoms. Bothselenium and browserbot will probably needto remain as there is an extensive amount of code that relies on usingthe APIs it exposes, but it is expected that they will ultimately beshell classes, delegating to the atoms as quickly as possible.

16.9. Looking Back

Building a browser automation framework is a lot like painting a room;at first glance, it looks like something that should be pretty easy todo. All it takes is a few coats of paint, and the job's done. Theproblem is, the closer you get, the more tasks and details emerge, andthe longer the task becomes. With a room, it's things like workingaround light fittings, radiators and the skirting boards that start toconsume time. For a browser automation framework, it's the quirks anddiffering capabilities of browsers that make the situation morecomplex. The extreme case of this was expressed by Daniel Wagner-Hallas he sat next to me working on the Chrome driver; he banged his handson the desk and in frustration muttered, "It's all edge cases!" Itwould be nice to be able to go back and tell myself that, and that theproject is going to take a lot longer than I expected.

I also can't help but wonder where the project would be if we'didentified and acted upon the need for a layer like the automationatoms sooner than we did. It would certainly have made some of thechallenges the project faced, internal and external, technically andsocially, easier to deal with. Core and RC were implemented in afocused set of languages—essentially just Javascript and Java. JasonHuggins used to refer to this as providing Selenium with a level of"hackability", which made it easy for people to get involved withthe project. It's only with the atoms that this level of hackabilityhas become widely available in WebDriver. Balanced against this, thereason why the atoms can be so widely applied is because of theClosure compiler, which we adopted almost as soon as it was releasedas Open Source.

It's also interesting to reflect on the things that we got right. Thedecision to write the framework from the viewpoint of the user issomething that I still feel is correct. Initially, this paid off asearly adopters highlighted areas for improvement, allowing the utilityof the tool to increase rapidly. Later, as WebDriver gets asked to domore and harder things and the number of developers using itincreases, it means that new APIs are added with care and attention,keeping the focus of the project tight. Given the scope of what we'retrying to do, this focus is vital.

Binding tightly to the browser is something that is both right andwrong. It's right, as it has allowed us to emulate the user withextreme fidelity, and to control the browser extremely well. It'swrong because this approach is extremely technically demanding,particularly when finding the necessary hook point into thebrowser. The constant evolution of the IE driver is a demonstration ofthis in action, and, although it's not covered here, the same is trueof the Chrome driver, which has a long and storied history. At somepoint, we'll need to find a way to deal with this complexity.

16.10. Looking to the Future

There will always be browsers that WebDriver can't integrate tightlyto, so there will always be a need for Selenium Core. Migrating thisfrom its current traditional design to a more modular design based onthe same Closure Library that the atoms are using is underway. We alsoexpect to embed the atoms more deeply within the existing WebDriverimplementations.

One of the initial goals of WebDriver was to act as a building blockfor other APIs and tools. Of course, Selenium doesn't live in avacuum: there are plenty of other Open Source browser automationtools. One of these is Watir (Web Application Testing In Ruby), andwork has begun, as a joint effort by the Selenium and Watirdevelopers, to place the Watir API over the WebDriver core. We're keento work with other projects too, as successfully driving all thebrowsers out there is hard work. It would be nice to have a solidkernel that others could build on. Our hope is that the kernel isWebDriver.

A glimpse of this future is offered by Opera Software, who haveindependently implemented the WebDriver API, using the WebDriver testsuites to verify the behavior of their code, and who will bereleasing their own OperaDriver. Members of the Selenium team are alsoworking with members of the Chromium team to add better hooks andsupport for WebDriver to that browser, and by extension to Chrometoo. We have a friendly relationship with Mozilla, who havecontributed code for the FirefoxDriver, and with the developers of thepopular HtmlUnit Java browser emulator.

One view of the future sees this trend continue, with automation hooksbeing exposed in a uniform way across many different browsers. Theadvantages for people keen to write tests for web applications areclear, and the advantages for browser manufacturers are alsoobvious. For example, given the relative expense of manual testing,many large projects rely heavily on automated testing. If it's notpossible, or even if it's "only" extremely taxing, to test with aparticular browser, then tests just aren't run for it, with knock-oneffects for how well complex applications work with thatbrowser. Whether those automation hooks are going to be based onWebDriver is an open question, but we can hope!

The next few years are going to be very interesting. As we're an opensource project, you'd be welcome to join us for the journey athttp://selenium.googlecode.com/.

Footnotes

  1. http://fit.c2.com
  2. This is very similar toFIT, and James Shore, one of that project's coordinators, helpsexplain some of the drawbacks athttp://jamesshore.com/Blog/The-Problems-With-Acceptance-Testing.html.
  3. For example, the remote server returns abase64-encoded screen grab with every exception as a debugging aidbut the Firefox driver doesn't.
  4. I.e., alwaysreturns the same result.