Nutch中User Agent的问题

来源:互联网 发布:和男闺蜜做过的事 知乎 编辑:程序博客网 时间:2024/06/16 10:20

本人原创,转载请注明出处(http://blog.csdn.net/panjunbiao/article/details/16960029)。

Apache Nutch 1.7版本抓取某个网站的时候出现错误提示:

2013-11-25 15:23:37,793 INFO  api.HttpRobotRulesParser - Couldn't get robots.txt for http://www.xxx.com/: java.io.EOFException
2013-11-25 15:23:37,893 ERROR http.Http - Failed to get protocol output
java.io.EOFException
        at org.apache.nutch.protocol.http.HttpResponse.readLine(HttpResponse.java:427)
        at org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:319)
        at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:154)
        at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
        at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:140)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
2013-11-25 15:23:37,903 INFO  fetcher.Fetcher - fetch of http://www.xxxx.com/ failed with: java.io.EOFException 
也就是无法获取robots.txt文件,检查HTTP请求并无明显问题,其他网站是可以抓取的。

用wireshark抓包发现,抓取****网站的GET请求收不到回应,怀疑是User-Agent的问题。

用curl测试之,先用包含Nutch关键字的User-Agent:
Jamess-MacBook-Pro:~ james$ curl -A "Friendly Crawler/Nutch-1.7" -v http://www.****.com/robots.txt
* Adding handle: conn: 0x7fe33400fe00
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7fe33400fe00) send_pipe: 1, recv_pipe: 0
* About to connect() to www.****.com port 80 (#0)
*   Trying 42.121.98.156...
* Connected to www.****.com (42.121.98.156) port 80 (#0)
> GET /robots.txt HTTP/1.1
> User-Agent: Friendly Crawler/Nutch-1.7
> Host: www.****.com
> Accept: */*
>
* Empty reply from server
* Connection #0 to host www.****.com left intact
curl: (52) Empty reply from server

再用不包含Nutch关键字的User-Agent:
Jamess-MacBook-Pro:~ james$ curl -A "Chrome" -v http://www.****.com/robots.txt
* Adding handle: conn: 0x7f82f180fe00
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7f82f180fe00) send_pipe: 1, recv_pipe: 0
* About to connect() to www.****.com port 80 (#0)
*   Trying 42.121.98.156...
* Connected to www.****.com (42.121.98.156) port 80 (#0)
> GET /robots.txt HTTP/1.1
> User-Agent: Chrome
> Host: www.****.com
> Accept: */*
>
< HTTP/1.1 200 OK
* Server nginx/1.5.6 is not blacklisted
< Server: nginx/1.5.6
< Date: Mon, 25 Nov 2013 08:51:13 GMT
< Content-Type: text/plain; charset=utf-8
< Content-Length: 60
< Last-Modified: Tue, 04 Dec 2012 01:29:47 GMT
< Connection: keep-alive
< ETag: "50bd520b-3c"
< X-UA-Compatible: IE=Edge,chrome=1
< X-XSS-Protection: 1; mode=block
< Accept-Ranges: bytes
<
User-agent: *
Disallow: /robot/trap
Disallow: /page/6559999
* Connection #0 to host www.****.com left intact

果然,目标网站对User-Agent包含Nutch关键字的请求过滤了,然而在哪里修改这个User-Agent字段的内容呢?
搜索apache中包含http.agent.name源代码,在./plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java中:
  // Inherited Javadoc
  public void setConf(Configuration conf) {
      this.conf = conf;
      this.proxyHost = conf.get("http.proxy.host");
      this.proxyPort = conf.getInt("http.proxy.port", 8080);
      this.useProxy = (proxyHost != null && proxyHost.length() > 0);
      this.timeout = conf.getInt("http.timeout", 10000);
      this.maxContent = conf.getInt("http.content.limit", 64 * 1024);
      this.userAgent = getAgentString(conf.get("http.agent.name"), conf.get("http.agent.version"), conf
              .get("http.agent.description"), conf.get("http.agent.url"), conf.get("http.agent.email"));
      this.acceptLanguage = conf.get("http.accept.language", acceptLanguage);
      this.accept = conf.get("http.accept", accept);
      // backward-compatible default setting
      this.useHttp11 = conf.getBoolean("http.useHttp11", false);
      this.robots.setConf(conf);
      logConf();
  } 
可以发现是http.agent.version设置的,修改之,问题解决。


原创粉丝点击