nutch1.2 eclipse tomcat6.0 配置

来源：互联网发布：python xpath 编辑：程序博客网时间：2024/06/14 18:39

1.安装cygwin（windows下跑linux环境的软件），地址http://www.cygwin.com/，可以在线安装或下载到本地

我使用的我们的校内软件下载资源进行的下载和安装，镜像选用的http://mirrors.163.com/cygwin/ 速度出奇的快！很赞

安装后记得配置环境，开始没配置，后来在eclipse中编译的时候会出错

配置如下：环境变量中添加

CYGWIN变量，值为ntsec

Path变量中添加E:\myself\Lab\cygwin\bin （即你安装cygwin下的bin文件夹所在目录），如果没有Path变量则新建

2.下载nutch，地址http://labs.renren.com/apache-mirror/nutch/apache-nutch-1.2-bin.tar.gz 觉得这个地址很囧，竟然是人人的……

这里可以下载很多东西，还是很赞的，我下载的是nutch-1.2，解压后我放在了cygwin\home\Happy(Happy处为你的用户名)目录下，主要为便于在cygwin中输入命令，因为这是它默认的主目录

3.抓取过程

这一部分引用了其他人的博客~~~ T_T（红色部分为我自己标注的）

在 nutch-1.2新建文件夹 urls ，在 urls 建一文本文件，文件名任意，添加一行内容： http://lucene.apache.org/nutch/ ，这是要搜索的网址 (urls/nutch里的路径一定要加入"/")

打开 nutch-1.2下的 conf ，找到 crawl-urlfilter.txt ，找到这两行

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

红色部分是一个正则，你要搜索的网址要与其匹配，在这里我改为 +^http://([a-z0-9]*\.)*apache.org/

如果想要搜索所有的网页，可以直接用+^

编辑conf目录下的nutch-site.xml文件,该文件用于将爬虫信息告诉被抓取的网站,如果不进行设置nutch不能运行.

该文件默认为这样:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    下面是我修改后的一个例子:
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
        <property>
          <name>http.agent.name</name>  这个很重要，要配置
          <value>myfirsttest</value> value不能为空
          <description>HTTP 'User-Agent' request header. MUST NOT be empty -
          please set this to a single word uniquely related to your organization.

          NOTE: You should also check other related properties:

          http.robots.agents
          http.agent.description
          http.agent.url
          http.agent.email
          http.agent.version

and set their values appropriately.

</description>
</property>

        <property>
          <name>http.agent.description</name>
          <value>myfirsttest</value>
          <description>Further description of our bot- this text is used in
          the User-Agent header. It appears in parenthesis after the agent name.
          </description>
        </property>

        <property>
          <name>http.agent.url</name>
          <value>myfirsttest.com</value>
          <description>A URL to advertise in the User-Agent header. This will
           appear in parenthesis after the agent name. Custom dictates that this
           should be a URL of a page explaining the purpose and behavior of this
           crawler.
          </description>
        </property>

        <property>
          <name>http.agent.email</name>
          <value>test@test.com</value>
          <description>An email address to advertise in the HTTP 'From' request
           header and User-Agent header. A good practice is to mangle this
           address (e.g. 'info at example dot com') to avoid spamming.
          </description>
        </property>

</configuration>
上述文件描述了爬虫的名称/描述/来自哪个网站/联系email等信息.

‍《下面又回到原创啦》

然后打开cygwin，cd到nutch-1.2所在的文件夹

执行 "bin/nutch crawl urls -dir crawled -depth 3 -topN 50 -threads 10 >& crawl.log” 命令含义如下

参数意义如下（来自 apache 网站 http://lucene.apache.org/nutch/tutorial8.html ）：

-dir 后面跟着放爬虫爬行结果的目录，这个目录必须是之前不存在的目录

-threads 执行该命令开的线程数目

-depth 爬行深度

-topN 每一层爬行的url数目，从最前面的url开始爬

crawl.log ：日志文件，可以查看爬行过程

执行后可以看到 nutch-1.2下新增一个 crawled 文件夹，它下面有 5 个文件夹：

① / ② crawldb/ linkdb ： web link 目录，存放 url 及 url 的互联关系，作为爬行与重新爬行的依据，页面默认 30 天过期（可以在 nutch-site.xml 中配置，后面会提到）

③ segments ：一存放抓取的页面，与上面链接深度 depth 相关， depth 设为 2 则在 segments 下生成两个以时间命名的子文件夹，比如 ” 20061014163012” ，打开此文件夹可以看到，它下面还有 6 个子文件夹，分别是

（来自apache http://lucene.apache.org/nutch/tutorial8.html ）：

crawl_generate ： names a set of urls to be fetched

crawl_fetch ： contains the status of fetching each url

content ： contains the content of each url

parse_text ： contains the parsed text of each url

parse_data ： contains outlinks and metadata parsed from each url

crawl_parse ： contains the outlink urls, used to update the crawldb

④ indexes ：索引目录，我运行时生成了一个 ” part-00000” 的文件夹，

⑤ index ： lucene 的索引目录（ nutch 是基于 lucene 的，在 nutch-1.2\lib 下可以看到 lucene-core-1.9.1.jar ，最后有 luke 工具的简单使用方法），是 indexs 里所有 index 合并后的完整索引，注意索引文件只对页面内容进行索引，没有进行存储，因此查询时要去访问 segments 目录才能获得页面内容

4. 进行简单测试，在 cygwin 中输入 ”bin/nutch org.apache.nutch.searcher.NutchBean apache” ，即调用 NutchBean 的 main 方法搜索关键字 ”apache” ，在 cygwin 可以看到搜索出： Total hits: 29 （ hits 相当于 JDBC 的 results ）

注意：如果发现搜索结果始终为 0 ，则需要配置一下 nutch-1.2\conf 的 nutch-site.xml 试试添加下面这段：（注意之前的http.agent.name必须有，如果没有这个property，则搜索结果一直为0）

<name>searcher.dir</name>

<value>D:\nutch\crawled</value> searcher.dir ：指定前面在 cygwin 中生成的 crawled 路径，即存放爬行结果的目录

</property>

‍我们还可以设置重新爬行时间（在前面提到：页面默认 30 天过期）

<name>fetcher.max.crawl.delay</name>

</property>

‍好了，现在搜索结果终于不再为0了~ 开心~~~

5.tomcat 安装，这里我折腾了好久，开始使用的安装版，结果总是出现启动不成功，一闪就过的情况……

最终没有解决，于是下载了tomcat绿色版，即免安装版，不管是啥版，都要先配置环境

‍CATALINA_BASE变量：E:\myself\Lab\Tomcat6.0 都是安装目录

CATALINA_HOME变量：E:\myself\Lab\Tomcat6.0

‍TOMCAT_HOME变量：E:\myself\Lab\Tomcat6.0

‍classpath中添加：%CATALINA_HOME%\lib\servlet-api.jar 老版的tomcat中的servlet-api.jar可能不在这个目录下，您就自己找找看吧

Path中添加：%TAMCAT_HOME%\bin

至此，环境变量是配置好了对了，千万要注意，安装目录的名字中不要出现空格，例如Tomcat 6.0，否则后面会出错…… T_T 这个错误搜了很多地方才找到，太隐蔽了……

然后双击Tomcat6.0目录下bin中的startup.bat脚本，若运行成功，则恭喜恭喜，如果不成功，则在cmd下进入Tomcat6.0的bin目录运行startup.bat脚本，查看一下Tomcat6.0/logs文件夹下的log文件，是以时间命名的，所以很好查找，同一天的log信息在一个log文件中

根据出错的log信息去网上搜索哪里出错了然后进行修改吧~

当在浏览器中输入localhost:8080出现那只久违的猫后，恭喜恭喜啦~

6.在tomcat中部署nutch，将‍nutch-1.2文件夹下的nutch-1.2.war复制到tomcat下，然后运行tomcat，它会自动解压nutch-1.2.war文件到Tomcat6.0\webapps下，并且命名为nutch，修改/nutch/WEB-INF/classes/nutch-site.xml :

将

<nutch-conf>

</nutch-conf>

换成

<nutch-conf>

<name>http.agent.name</name>

</property>

<name>searcher.dir</name>

<value>Your_crawl_dir_path</value>

</property>

</nutch-conf>

Your_crawl_dir_path指刚才抓取网页时网页保存的文件夹

最后在浏览器中输入 http://localhost:8080/nutch ，就可以看到nutch的搜索界面了。注意每次修改nutch-site.xml 文件后都要重新启动tomcat

这时候的nutch在搜索时可能会出现中文乱码，其实这是tomcat的问题。

解决办法：对/tomcat/apache-tomcat-6.0.20/conf目录下的server.xml做一下修改：

将

<Connector port="8080" protocol="HTTP/1.1"

connectionTimeout="20000"

redirectPort="8443" />

改为

<Connector port="8080" protocol="HTTP/1.1"

connectionTimeout="20000"

redirectPort="8443"

URIEncoding="UTF-8"

useBodyEncodingForURI="true"/>

然后重启tomcat即可。