Crawl you website including login form with Phantomjs

来源：互联网发布：js修改classname 编辑：程序博客网时间：2024/06/05 15:51

Crawl you website including login form with Phantomjs

Sep 27th, 2013

With PhantomJS, we start a headless WebKit and pilot it with our own scripts. Said differently, we write a script in JavaScript or CoffeeScript which controls an Internet browser and manipulates the webpage loaded inside. In the past, I’ve used a similar solution called [Selenium]. PhantomJS is much faster, it doesn’t start a graphical browser (that’s what headless stands for) and you can inject your own JavaScript inside the page (I can’t remember that we could do such a thing with Selenium).

PhantomJS is commonly used for testing websites and HTML-based applications which content is dynamically updated with JavaScript events and Ajax requests. The product is also popular to generate screenshot of webpages and build website previews, a usage illustrated below.

The official website present PhantomJS as: * Headless Website Testing: Run functional tests with frameworks such as Jasmine, QUnit, Mocha, Capybara, WebDriver, and many others. * Screen Capture: Programmatically capture web contents, including SVG and Canvas. Create web site screenshots with thumbnail preview. * Page Automation: Access and manipulate webpages with the standard DOM API, or with usual libraries like jQuery. * Network Monitoring: Monitor page loading and export as standard HAR files. Automate performance analysis using YSlow and Jenkins.

In my case, I’ve used it to simulate users behaviors under high load to create user logs and populate a system like Google Analytics. More specifically, I will introducted a project architecture composed of 3 components: 1. User-writtenPhantomJS scripts that I later call “actions”. An action simulates user interactions and could be chained with other actions. For example a first action could login a user and a second one could update its personal information. 2. A generic PhantomJS script to run sequencially multiple actions passed as arguments. 3. A Node.js script to pilot PhantomJS and simulate concurrent user loads.

To make things more interesting, the user-written scripts will show you how to simulate a user login, or any form submission. Please don’t use it as a basis to login into your (boy|girl)friend Gmail account.

The user-written scripts

I will write 2 scripts for illustration purpose. The first will login the user on a fake website and the second will go to two user information pages. Those scripts are written in CoffeeScript and interact with the PhantomJS API which borrow a lot from the CommonJs specification. Keep in mind that even if it looks a lot like Node.js, it’s JavaScript after all, it will run in a completely different environment.

The login action

123456789101112131415161718

webpage = require 'webpage'module.exports = (callback) ->  page = webpage.create()  url = 'https://mywebsite.com/login'  count = 0  page.onLoadFinished = ->    console.log '** login', count    page.render "login_#{count}.png"    if count is 0      page.evaluate ->        jQuery('#login').val('IDTMAASP15')        jQuery('#pass').val('azerty1')        jQuery('[name="loginForm"] [name="submit"]').click()    else if count is 1      callback()    count++  page.open url, (status) ->    return new Error "Invalid webage" if status isnt 'success'

The information action

1234567891011121314151617181920212223

webpage = require 'webpage'module.exports = (callback) ->  page = webpage.create()  count = 0  page.onLoadFinished = ->    console.log 'info', count    page.render "donnees_perso_#{count}.png"    if count is 0      page.evaluate ->        window.location = jQuery('.boxSection [href*=info]')[0].href    else if count is 1      page.evaluate ->        window.location = jQuery('.services [href*=info_perso]')[0].href    else if count is 2      page.goBack()    else if count is 3      page.evaluate ->        window.location = jQuery('.services [href*=info_login]')[0].href    else if count is 4      callback()    count++  page.open 'https://domain/path/to/login', (status) ->    return callback new Error "Invalid webage" if status isnt 'success'

There are a few things in this code which are interesting and that I will comment.

On line 9, the call to page.render generates a screenshot of the webpage at the time of the call. Generating website screen captures is a common use ofPhantomJS.

The code is run inside the PhantomJS execution engine with the exception of the one inside the page.evaluate running inside the loaded webpage. This simplify the writing of your PhantomJS script but is a little awkward in the sense that you wont be able to share context between those two sections. It is like if the webpage code is evaluated withpage.evaluate.toString` and run inside a separate engine.

Finally, the page object represents all the pages we will load. It is more appropriate to conceive it as a tab inside your browser inside which multiple pages are loaded. The function page.onLoadFinished is called every time a page is loaded.

2. The action runner

This script is also run inside PhantomJS. Its purpose is to run multiple actions sequentially (one after the other) in a generic manner.

The action runner takes a list of actions provided as arguments, load theJavaScript scripts named after the actions and run those scripts sequentially.

12345678910111213141516

# Grab argumentsargs = require('system').args# Convert to an arrayargs = Array.prototype.slice.call(args, 0)# Remove the script filenameargs.shift()# Callback when all action have been rundone = (err) ->  phantom.exit if err then 1 else 0# Run the next actionnext = (err) ->  n = args.shift()  return done err if err or not n  n = require "./#{n}"  n nextnext()

3. The pilot

The pilot is a Node.js application responsible for Managing and MonitoringPhantomJS. It is able to simulate concurrent load by running multiple instances of PhanomJs in parallel. To achieve concurrency, I used the Node.js each module. The each.prototype.parallel indicates how many instances of PhantomJS will run at the same time. The each.prototype.repeat indicate how many times each action will run.

12345678910111213141516171819202122232425262728293031323334

fs = require 'fs'util = require 'util'phantomjs = require 'phantomjs'each = require 'each'child = require 'child_process'cookies = "#{__dirname}/cookies.txt"run = (actions, callback) ->  args = [    "--ignore-ssl-errors=yes"    "--cookies-file=#{cookies}"    "#{__dirname}/run.js"  ]  for action in actions then args.push action  util.print "\x1b[36m..#{actions.join(' ')} start..\x1b[39m\n"  web = child.spawn phantomjs.path, args  web.stdout.on 'data', (data) ->    util.print "\x1b[36m#{data.toString()}\x1b[39m"  web.stderr.on 'data', (data) ->    util.print "\x1b[35m#{data.toString()}\x1b[39m"  web.on 'close', (code) ->    util.print "\x1b[36m..#{actions.join(' ')} done..\x1b[39m\n"    if callback      err = if code isnt 0 then new Error "Invalid exit code #{code}" else null      callback erreach([  ['login','information']  ['login','another_action'] ]).parallel(2).repeat(20).on 'item', (scripts, next) ->  fs.unlink cookies, (err) ->    run scripts, next

Put it all together

In the end, you might create a Node.js project (simply a directory with a package.json file inside), place all the files described above inside the new directory, declare your “phantomjs” and “each” module dependencies (inside the package.js file), install them with npm install and run your “run.js” script with the command node run.js.

Note about PhantomJs cookies

This is a personal section covering my experience on using the cookies support.PhantomJS accept a “cookies-file” argument with a file path as a value. Basically, a PhantomJS command would look like phantomjs --cookies-file=#{cookies} {more_arguments} {script_path} {script_arguments}.

After a few trials, I wasn’t able to use the cookies file efficiently. Trying to run a second script will not honored the persisted session. However, if I don’t exit PhantomJS with phantom.exit() and force quit the application instead, then the cookie file will work as expected.

This is one of the two reasons why I came up with such an architecture in which I can chain multiple actions. The other reason is speed since the headless Webkit instance is started fewer times. I don’t blame PhantomJS, it could be something I pass over in the documentation.

[selenium]

0 0