Nutch中User Agent的问题

来源：互联网发布：和男闺蜜做过的事知乎编辑：程序博客网时间：2024/06/16 10:20

本人原创，转载请注明出处（http://blog.csdn.net/panjunbiao/article/details/16960029）。

Apache Nutch 1.7版本抓取某个网站的时候出现错误提示：

2013-11-25 15:23:37,793 INFO api.HttpRobotRulesParser - Couldn't get robots.txt for http://www.xxx.com/: java.io.EOFException

2013-11-25 15:23:37,893 ERROR http.Http - Failed to get protocol output

java.io.EOFException

at org.apache.nutch.protocol.http.HttpResponse.readLine(HttpResponse.java:427)

at org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:319)

at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:154)

at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)

at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:140)

at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)

2013-11-25 15:23:37,903 INFO fetcher.Fetcher - fetch of http://www.xxxx.com/ failed with: java.io.EOFException

也就是无法获取robots.txt文件，检查HTTP请求并无明显问题，其他网站是可以抓取的。

用wireshark抓包发现，抓取****网站的GET请求收不到回应，怀疑是User-Agent的问题。

用curl测试之，先用包含Nutch关键字的User-Agent：

Jamess-MacBook-Pro:~ james$ curl -A "Friendly Crawler/Nutch-1.7" -v http://www.****.com/robots.txt

* Adding handle: conn: 0x7fe33400fe00

* Adding handle: send: 0

* Adding handle: recv: 0

* Curl_addHandleToPipeline: length: 1

* - Conn 0 (0x7fe33400fe00) send_pipe: 1, recv_pipe: 0

* About to connect() to www.****.com port 80 (#0)

* Trying 42.121.98.156...

* Connected to www.****.com (42.121.98.156) port 80 (#0)

> GET /robots.txt HTTP/1.1

> User-Agent: Friendly Crawler/Nutch-1.7

> Host: www.****.com

> Accept: */*

* Empty reply from server

* Connection #0 to host www.****.com left intact

curl: (52) Empty reply from server

再用不包含Nutch关键字的User-Agent：

Jamess-MacBook-Pro:~ james$ curl -A "Chrome" -v http://www.****.com/robots.txt

* Adding handle: conn: 0x7f82f180fe00

* Adding handle: send: 0

* Adding handle: recv: 0

* Curl_addHandleToPipeline: length: 1

* - Conn 0 (0x7f82f180fe00) send_pipe: 1, recv_pipe: 0

* About to connect() to www.****.com port 80 (#0)

* Trying 42.121.98.156...

* Connected to www.****.com (42.121.98.156) port 80 (#0)

> GET /robots.txt HTTP/1.1

> User-Agent: Chrome

> Host: www.****.com

> Accept: */*

< HTTP/1.1 200 OK

* Server nginx/1.5.6 is not blacklisted

< Server: nginx/1.5.6

< Date: Mon, 25 Nov 2013 08:51:13 GMT

< Content-Type: text/plain; charset=utf-8

< Content-Length: 60

< Last-Modified: Tue, 04 Dec 2012 01:29:47 GMT

< Connection: keep-alive

< ETag: "50bd520b-3c"

< X-UA-Compatible: IE=Edge,chrome=1

< X-XSS-Protection: 1; mode=block

< Accept-Ranges: bytes

User-agent: *

Disallow: /robot/trap

Disallow: /page/6559999

* Connection #0 to host www.****.com left intact

果然，目标网站对User-Agent包含Nutch关键字的请求过滤了，然而在哪里修改这个User-Agent字段的内容呢？

搜索apache中包含http.agent.name源代码，在./plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java中：

// Inherited Javadoc

public void setConf(Configuration conf) {

this.conf = conf;

this.proxyHost = conf.get("http.proxy.host");

this.proxyPort = conf.getInt("http.proxy.port", 8080);

this.useProxy = (proxyHost != null && proxyHost.length() > 0);

this.timeout = conf.getInt("http.timeout", 10000);

this.maxContent = conf.getInt("http.content.limit", 64 * 1024);

this.userAgent = getAgentString(conf.get("http.agent.name"), conf.get("http.agent.version"), conf

.get("http.agent.description"), conf.get("http.agent.url"), conf.get("http.agent.email"));

this.acceptLanguage = conf.get("http.accept.language", acceptLanguage);

this.accept = conf.get("http.accept", accept);

// backward-compatible default setting

this.useHttp11 = conf.getBoolean("http.useHttp11", false);

this.robots.setConf(conf);

logConf();

}

可以发现是http.agent.version设置的，修改之，问题解决。