PhantomJS 服务端渲染网页,记录各个请求信息

来源：互联网发布：协同过滤算法实现编辑：程序博客网时间：2024/05/22 10:40

from: http://phantomjs.org

from : http://bluehua.org/tag/phantomjs

Applies to: PhantomJS 1.6.

This instruction assumes that PhantomJS is built and its executable is place somewhere in the PATH.

All of the examples given here are available in the code repository under the sub-directory examples/. For each example, there is two version for each JavaScript and CoffeeScript.

Consult also the API reference.

Hello, world!

Create a new text file that contains the following two lines:

console.log('Hello, world!');phantom.exit();

Save it as hello.js and the run it:

phantomjs hello.js

The output is:

Hello, world!

In the first line, console.log will print the passed string to the terminal. In the second line,phantom.exit terminates the execution.

It is very important to call phantom.exit at some point in the script, otherwise PhantomJS will not be terminated at all.

Delay

To have an asynchronous (non-blocking) delay, use the usual window.setTimeout or window.setInterval function. This fibo.js example prints the Fibonacci sequence with each new number showing up every 300 ms.

var fibs = [0, 1];var ticker = window.setInterval(function () {    console.log(fibs[fibs.length - 1]);    fibs.push(fibs[fibs.length - 1] + fibs[fibs.length - 2]);    if (fibs.length > 10) {        window.clearInterval(ticker);        phantom.exit();    }}, 300);

Because setTimeout/setInterval is non-blocking, do not call phantom.exit right after that. Doing so will cause the script to terminate immediately.

Script Arguments

Using the args array from the System module, the script can obtain the list of command-line arguments.

Consider the following arguments.js example:

var system = require('system');if (system.args.length === 1) {    console.log('Try to pass some args when invoking this script!');} else {    system.args.forEach(function (arg, i) {            console.log(i + ': ' + arg);    });}phantom.exit();

If it is invoked using the following command:

phantomjs argument.js The quick brown fox

then the output will be:

 0: arguments.js 1: The 2: quick 3: brown 4: fox

Loading

The script invoked by PhantomJS initially runs on an empty web page. This is not so useful, thus PhantomJS offers the possibility of loading arbitrary URL viaphantom.open function on a web page. To encapsulate a web page, instantiate aWebPage object.

A specific URL can be loaded using its open() function. A typical usage is:

var page = require('webpage').create();page.open(url, function (status) {  // do something});

The callback in the open() is executed when the page loading is completed, with status equals to "success" if there is no error and "failed" is error has occurred.

The above construct is a convenient version of the following:

var page = require('webpage').create();page.onLoadFinished = function (status) {  // do something};page.open(url);

Beside onLoadFinished, there is also onLoadStarted which is invoked when page loading starts for the first time:

var page = require('webpage').create();page.onLoadStarted = function () {    console.log('Start loading...');};page.onLoadFinished = function (status) {    console.log('Loading finished.');};page.open(url);

The following loadspeed.js script loads a specified URL (do not forget the http protocol) and measures the time it takes to load it.

var page = require('webpage').create(),    t, address;if (phantom.args.length === 0) {    console.log('Usage: loadspeed.js <some URL>');    phantom.exit();} else {    t = Date.now();    address = phantom.args[0];    page.open(address, function (status) {        if (status !== 'success') {            console.log('FAIL to load the address');        } else {            t = Date.now() - t;            console.log('Loading time ' + t + ' msec');        }        phantom.exit();    });}

Run the script with the command:

phantomjs loadspeed.js http://www.google.com

It outputs something like:

Loading http://www.google.comLoading time 719 msec

Page settings

The behavior of the web page can be set via its settings object which can contain properties such as:

loadImages defines whether to load inline images or not (default to true)
userAgent defines the user agent string passed to the server

For the full list, refer to the complete page settings reference.

The initial values for the settings are from the command-line options specified when invoking the script.

As an example, here is how to change the user agent:

var page = require('webpage').create();page.settings.userAgent = 'Dragonless Phantom';page.open(url, function (status) {  // do something});

Rendering

A web page can be rasterized to an image or a PDF file using `render()` function.

This rasterize.js is all it takes to capture a web site.

var page = require('webpage').create(),    address, output, size;if (phantom.args.length < 2 || phantom.args.length > 3) {    console.log('Usage: rasterize.js URL filename');    phantom.exit();} else {    address = phantom.args[0];    output = phantom.args[1];    page.viewportSize = { width: 600, height: 600 };    page.open(address, function (status) {        if (status !== 'success') {            console.log('Unable to load the address!');        } else {            window.setTimeout(function () {                page.render(output);                phantom.exit();            }, 200);        }    });}

An example to produce the rendering of the famous Tiger (from SVG):

phantomjs rasterize.js http://ariya.github.com/svg/tiger.svg tiger.png

which gives the following tiger.png:

Another example: show the polar clock (from RaphaelJS):

phantomjs rasterize.js http://raphaeljs.com/polar-clock.html clock.png

Producing PDF output is also possible, e.g. from a Wikipedia article:

phantomjs rasterize.js 'http://en.wikipedia.org/w/index.php?title=Jakarta&printable=yes' jakarta.pdf

or when creating printer-ready cheat sheet:

phantomjs rasterize.js http://www.nihilogic.dk/labs/webgl_cheat_sheet/WebGL_Cheat_Sheet.htm webgl.pdf

Code Evaluation

To evaluate JavaScript or CoffeeScript code in the context of the web page, useevaluate() function. The execution is sandboxed, there is no way for the code to access any JavaScript objects and variables outside its own page context. An object can be returned fromevaluate(), however it is limited to simple objects and can't contain functions or closures.

Here is an example to show the title of a web page:

var page = require('webpage').create();page.open(url, function (status) {    var title = page.evaluate(function () {        return document.title;    });    console.log('Page title is ' + title);});

Any console message from a web page, including from the code inside evaluate(), will not be displayed by default. To override this behavior, use theonConsoleMessage callback. The previous example can be rewritten to:

var page = require('webpage').create();page.onConsoleMessage = function (msg) {    console.log('Page title is ' + msg);};page.open(url, function (status) {    page.evaluate(function () {        console.log(document.title);    });});

Canvas

Canvas can be easily constructed and utilized. The following colorwheel.js produces the color wheel.

var page = new WebPage;page.viewportSize = { width: 400, height : 400 };page.content = '<html><body><canvas id="surface"></canvas></body></html>';page.evaluate(function() {    var el = document.getElementById('surface'),        context = el.getContext('2d'),        width = window.innerWidth,        height = window.innerHeight,        cx = width / 2,        cy = height / 2,        radius = width  / 2.3,        imageData,        pixels,        hue, sat, value,        i = 0, x, y, rx, ry, d,        f, g, p, u, v, w, rgb;    el.width = width;    el.height = height;    imageData = context.createImageData(width, height);    pixels = imageData.data;    for (y = 0; y < height; y = y + 1) {        for (x = 0; x < width; x = x + 1, i = i + 4) {            rx = x - cx;            ry = y - cy;            d = rx * rx + ry * ry;            if (d < radius * radius) {                hue = 6 * (Math.atan2(ry, rx) + Math.PI) / (2 * Math.PI);                sat = Math.sqrt(d) / radius;                g = Math.floor(hue);                f = hue - g;                u = 255 * (1 - sat);                v = 255 * (1 - sat * f);                w = 255 * (1 - sat * (1 - f));                pixels[i] = [255, v, u, u, w, 255, 255][g];                pixels[i + 1] = [w, 255, 255, v, u, u, w][g];                pixels[i + 2] = [u, u, w, 255, 255, v, u][g];                pixels[i + 3] = 255;            }        }    }    context.putImageData(imageData, 0, 0);    document.body.style.backgroundColor = 'white';    document.body.style.margin = '0px';});page.render('colorwheel.png');phantom.exit();

DOM Manipulation

Since the script is executed as if it is running on a web browser, standard DOM scripting and CSS selectors work just fine.

The following useragent.js example demonstrates reading the innerText property of the element whose id ismyagent:

var page = require('webpage').create();console.log('The default user agent is ' + page.settings.userAgent);page.settings.userAgent = 'SpecialAgent';page.open('http://www.httpuseragent.org', function (status) {    if (status !== 'success') {        console.log('Unable to access network');    } else {        var ua = page.evaluate(function () {            return document.getElementById('myagent').innerText;        });        console.log(ua);    }    phantom.exit();});

The above example also demonstrates the use of phantom.userAgent to customize the user agent sent to the web server.

Here is another example: finding pizza in Mountain View.

var page = require('webpage').create(),    url = 'http://lite.yelp.com/search?find_desc=pizza&find_loc=94040&find_submit=Search';page.open(url, function (status) {    if (status !== 'success') {        console.log('Unable to access network');    } else {        var results = page.evaluate(function() {            var list = document.querySelectorAll('span.address'), pizza = [], i;            for (i = 0; i < list.length; i++) {                pizza.push(list[i].innerText);            }            return pizza;        });        console.log(results.join('\n'));    }    phantom.exit();});

An illustration of the use of document.querySelectorAll is given here: show the recent twitter status:

var page = require('webpage').create();page.onConsoleMessage = function(msg) {    console.log(msg);};page.open(encodeURI("http://mobile.twitter.com/Sencha"), function (status) {    if (status !== "success") {        console.log("Unable to access network");    } else {        page.evaluate(function() {            var list = document.querySelectorAll('span.status');            for (var i = 0; i < list.length; ++i) {                console.log((i + 1) + ": " + list[i].innerHTML.replace(/<.*?>/g, ''));            }        });    }    phantom.exit();});

Network traffic

All the resource requests and responses can be sniffed using the onResourceRequested andonResourceReceived. An example to dump everything is:

var page = require('webpage').create();page.onResourceRequested = function (request) {    console.log('Request ' + JSON.stringify(request, undefined, 4));};page.onResourceReceived = function (response) {    console.log('Receive ' + JSON.stringify(response, undefined, 4));};page.open(url);

The included examples/netsniff.js shows how to capture and process all the resource requests and responses and export the result inHAR format.

The following shows the waterfall diagram obtained from BBC website: