htmlunit 执行 javascript 时，不下载整个页面只返回url

来源：互联网发布：mac怎么用校园网编辑：程序博客网时间：2024/06/07 07:06

htmlunit 简介：
htmlunit 是一款开源的 java 页面分析工具，启动 htmlunit 之后，底层会启动一个无界面浏览器，用户可以指定浏览器类型：firefox、ie 等，如果不指定，默认采用 INTERNET_EXPLORER_7：
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3_6);

通过简单的调用：
HtmlPage page = webClient.getPage(url)；
即可得到页面的 HtmlPage 表示，然后通过：
InputStream is = targetPage.getWebResponse().getContentAsStream()
即可得到页面的输入流，从而得到页面的源码，这对做网络爬虫的项目来说，很有用。
当然，也可以从 page 中得更多的页面元素。

很重要的一点是，HtmlUnit 提供对执行 javascript 的支持：
page.executeJavaScript(javascript)
执行 js 之后，返回一个 ScriptResult 对象，通过该对象可以拿到执行 js 之后的页面等信息。默认情况下，内部浏览器在执行 js 之后，将做页面跳转，跳转到执行 js 之后生成的新页面，如果执行 js 失败，将不执行页面跳转。

htmlunit 执行 js 的大致过程如下：

从图中可以看出，htmlunit 执行js时，会将整个页面 download 下来，而很多时候，我们执行 js，只是因为需要执行后后生成的 url，不必要的频繁页面 download 不但会增加程序运行时长，也会加重网络负载。有下面两种方案可以完成这个需求：

1). 第一种方法是拿到这个 url 之后，将其返回，但代码的调用层次较深，如果修改源码的话，需要修改的地方可能较多，实现起来可能有一定的复杂性和难度。
2). 第二种方法是，生成一个伪 response，而不是去真正获取页面的 response，用来构造所有的新 page。该方法具有代码改动小，实现方便的特点。

因第一种方法对源码的修改大，实现起来也比较困难，这里给出第二种方法的实现：

查看源码可以发现，在：

com.gargoylesoftware.htmlunit.javascript.host.location.java 类中，有这样一个方法：

[java] view plain copy

public void jsxSet_href(final String newLocation) throws IOException {
final HtmlPage page = (HtmlPage) getWindow(getStartingScope()).getWebWindow().getEnclosedPage();
if (newLocation.startsWith(JavaScriptURLConnection.JAVASCRIPT_PREFIX)) {
final String script = newLocation.substring(11);
page.executeJavaScriptIfPossible(script, "new location value", 1);
return;
}
try {
final URL url = page.getFullyQualifiedUrl(newLocation);
final URL oldUrl = page.getWebResponse().getWebRequest().getUrl();
if (url.sameFile(oldUrl) && !StringUtils.equals(url.getRef(), oldUrl.getRef())) {
// If we're just setting or modifying the hash, avoid a server hit.
jsxSet_hash(newLocation);
return;
}
final WebWindow webWindow = getWindow().getWebWindow();
webWindow.getWebClient().download(webWindow, "", new WebRequest(url), "JS set location");
}
catch (final MalformedURLException e) {
LOG.error("jsxSet_location('" + newLocation + "') Got MalformedURLException", e);
throw e;
}
}

第 18 行，就是去 download 页面，download 方法如下：

[java] view plain copy

public void download(final WebWindow requestingWindow, final String target,
final WebRequest request, final String description) {
final WebWindow win = resolveWindow(requestingWindow, target);
final URL url = request.getUrl();
boolean justHashJump = false;
if (win != null) {
final Page page = win.getEnclosedPage();
if (page instanceof HtmlPage && !((HtmlPage) page).isOnbeforeunloadAccepted()) {
return;
}
<strong> final URL current = page.getWebResponse().getWebRequest().getUrl();</strong>
if (url.sameFile(current) && !StringUtils.equals(current.getRef(), url.getRef())) {
justHashJump = true;
}
}
// verify if this load job doesn't already exist
for (final LoadJob loadJob : loadQueue_) {
if (loadJob.response_ == null) {
continue;
}
final WebRequest otherRequest = loadJob.response_.getWebRequest();
final URL otherUrl = otherRequest.getUrl();
// TODO: investigate but it seems that IE considers query string too but not FF
if (url.getPath().equals(otherUrl.getPath())
&& url.getHost().equals(otherUrl.getHost())
&& url.getProtocol().equals(otherUrl.getProtocol())
&& url.getPort() == otherUrl.getPort()
&& request.getHttpMethod() == otherRequest.getHttpMethod()) {
return; // skip it;
}
}
final LoadJob loadJob;
if (justHashJump) {
loadJob = new LoadJob(win, target, url);
}
else {
try {
final WebResponse response = loadWebResponse(request);
loadJob = new LoadJob(requestingWindow, target, response);
}
catch (final IOException e) {
throw new RuntimeException(e);
}
}
loadQueue_.add(loadJob);
}

第40, 41行，拿到一个页面的 response，然后根据该 response 生成一个 LoadJob 对象，放入loadQueue_ 队列，后续将从队列中取出该 LoadJob 对象，完成生成新页面并加载至浏览器的工作。我们只要修改这里 response 的生成方式，思路如下：

如果当前线程是第一次执行该 download 方法，就不对代码做修改，让其生成一个真正的 response，然后，将该 response 对象保存起来，待该线程后续再执行 js 进入该方法，不再生成 response 对象，而是将之前保存起来的 response 拿出来直接使用，并修改对应的 url 为执行 js 之后生成的 url 即可：

response.getWebRequest().setUrl(request.getUrl());

js 执行完成之后，返回的 ScriptResult 对应的 url ，就是执行 js 之后生成的 url 了，但如果去拿页面的源码的话，会得到 ”错误“ 的数据，这是因为我们每次都用了同一个 response，而不是 url 页面对应的 url 。因为我们的初衷就是得到正确的 url ，而不去 download 整个页面，所以这种 ”错误“ 不会影响我们的程序。

0 0