Crawl you website including login form with Phantomjs
来源:互联网 发布:js修改classname 编辑:程序博客网 时间:2024/06/05 15:51
Crawl you website including login form with Phantomjs
With PhantomJS, we start a headless WebKit and pilot it with our own scripts. Said differently, we write a script in JavaScript or CoffeeScript which controls an Internet browser and manipulates the webpage loaded inside. In the past, I’ve used a similar solution called [Selenium]. PhantomJS is much faster, it doesn’t start a graphical browser (that’s what headless stands for) and you can inject your own JavaScript inside the page (I can’t remember that we could do such a thing with Selenium).
PhantomJS is commonly used for testing websites and HTML-based applications which content is dynamically updated with JavaScript events and Ajax requests. The product is also popular to generate screenshot of webpages and build website previews, a usage illustrated below.
The official website present PhantomJS as: * Headless Website Testing: Run functional tests with frameworks such as Jasmine, QUnit, Mocha, Capybara, WebDriver, and many others. * Screen Capture: Programmatically capture web contents, including SVG and Canvas. Create web site screenshots with thumbnail preview. * Page Automation: Access and manipulate webpages with the standard DOM API, or with usual libraries like jQuery. * Network Monitoring: Monitor page loading and export as standard HAR files. Automate performance analysis using YSlow and Jenkins.
In my case, I’ve used it to simulate users behaviors under high load to create user logs and populate a system like Google Analytics. More specifically, I will introducted a project architecture composed of 3 components: 1. User-writtenPhantomJS scripts that I later call “actions”. An action simulates user interactions and could be chained with other actions. For example a first action could login a user and a second one could update its personal information. 2. A generic PhantomJS script to run sequencially multiple actions passed as arguments. 3. A Node.js script to pilot PhantomJS and simulate concurrent user loads.
To make things more interesting, the user-written scripts will show you how to simulate a user login, or any form submission. Please don’t use it as a basis to login into your (boy|girl)friend Gmail account.
The user-written scripts
I will write 2 scripts for illustration purpose. The first will login the user on a fake website and the second will go to two user information pages. Those scripts are written in CoffeeScript and interact with the PhantomJS API which borrow a lot from the CommonJs specification. Keep in mind that even if it looks a lot like Node.js, it’s JavaScript after all, it will run in a completely different environment.
The login action
The information action
There are a few things in this code which are interesting and that I will comment.
On line 9, the call to page.render
generates a screenshot of the webpage at the time of the call. Generating website screen captures is a common use ofPhantomJS.
The code is run inside the PhantomJS execution engine with the exception of the one inside the page.evaluate
running inside the loaded webpage. This simplify the writing of your PhantomJS script but is a little awkward in the sense that you wont be able to share context between those two sections. It is like if the webpage code is evaluated with
page.evaluate.toString` and run inside a separate engine.
Finally, the page
object represents all the pages we will load. It is more appropriate to conceive it as a tab inside your browser inside which multiple pages are loaded. The function page.onLoadFinished
is called every time a page is loaded.
2. The action runner
This script is also run inside PhantomJS. Its purpose is to run multiple actions sequentially (one after the other) in a generic manner.
The action runner takes a list of actions provided as arguments, load theJavaScript scripts named after the actions and run those scripts sequentially.
3. The pilot
The pilot is a Node.js application responsible for Managing and MonitoringPhantomJS. It is able to simulate concurrent load by running multiple instances of PhanomJs in parallel. To achieve concurrency, I used the Node.js each module. The each.prototype.parallel
indicates how many instances of PhantomJS will run at the same time. The each.prototype.repeat
indicate how many times each action will run.
Put it all together
In the end, you might create a Node.js project (simply a directory with a package.json
file inside), place all the files described above inside the new directory, declare your “phantomjs” and “each” module dependencies (inside the package.js
file), install them with npm install
and run your “run.js” script with the command node run.js
.
Note about PhantomJs cookies
This is a personal section covering my experience on using the cookies support.PhantomJS accept a “cookies-file” argument with a file path as a value. Basically, a PhantomJS command would look like phantomjs --cookies-file=#{cookies} {more_arguments} {script_path} {script_arguments}
.
After a few trials, I wasn’t able to use the cookies file efficiently. Trying to run a second script will not honored the persisted session. However, if I don’t exit PhantomJS with phantom.exit()
and force quit the application instead, then the cookie file will work as expected.
This is one of the two reasons why I came up with such an architecture in which I can chain multiple actions. The other reason is speed since the headless Webkit instance is started fewer times. I don’t blame PhantomJS, it could be something I pass over in the documentation.
[selenium]
- Crawl you website including login form with Phantomjs
- How to Create Login Form with CSS3 and jQuery
- Smarten Up a Slick Login Form With CSS3
- 用wpscan扫描website-contact-form-with-file-upload的问题
- Spring Security Form Login
- C# login form
- [PhantomJS] Https Redirect Problem with PhantomJS + Selenium
- Error while performing database login with the sqljdbc driver:Unable to create connection. Check you
- login with google
- Login with PayPal DEMO
- Learing website development with django
- ios adhoc distribute with website
- Java Configuration and Form Login
- Make a website all about you.
- JOOMLA 3 import content form other website
- How To Automate Login A Website – Java Example
- How to Login Automatically into Website Using Excel VBA
- How to automate login a website – Java example
- apmserve中的Apache启动老是出错
- mysql语句:SET NAMES UTF8
- NO6 java内存泄露 effective JAVA 笔记
- 甘道夫(待续)
- 禁用IE浏览器插件提升打开网站速度方法
- Crawl you website including login form with Phantomjs
- Java反射的作用
- ubuntu 添加环境变量
- Android—大图or多图加载解决方案(完美解决OOM问题)
- hdu1465不容易系列之一
- java二维数组
- java 图像的直方图均衡化
- poj 4002 Alice's mooncake shop
- c++代码编译