nutch 2.* 导入eclipse

来源：互联网发布：各国驱逐舰知乎编辑：程序博客网时间：2024/05/21 21:35

这文章写的是2.1 我用的是2.2.1，试过了，可行。

转：http://cosmo1987.iteye.com/blog/1826971

Nutch2.1 in eclipse

主要目的：
1. 将nutch2.1放入eclipse中，便于调试源代码，查看nutch2.1是如何实现的。
2. 方便学习编写nutch2.1的plugin

准备:
Linux环境
Nutch2.1
Mysql
Java1.6
Eclipse

开始：
首先需要安装好jdk1.6，mysql，eclipse
开启eclipse，使用market place下载ivyDE，subeclipse
在首先进入/etc/my.cnf
在[mysqld]
下添加：
innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
开启mysql服务器
修改root用户密码为root

创建数据库：

    CREATE DATABASE nutch DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;

创建用户表：

    CREATE TABLE `webpage` (      `id` varchar(767) NOT NULL,      `headers` blob,      `text` mediumtext DEFAULT NULL,      `status` int(11) DEFAULT NULL,      `markers` blob,      `parseStatus` blob,      `modifiedTime` bigint(20) DEFAULT NULL,      `score` float DEFAULT NULL,      `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,      `baseUrl` varchar(767) DEFAULT NULL,      `content` longblob,      `title` varchar(2048) DEFAULT NULL,      `reprUrl` varchar(767) DEFAULT NULL,      `fetchInterval` int(11) DEFAULT NULL,      `prevFetchTime` bigint(20) DEFAULT NULL,      `inlinks` mediumblob,      `prevSignature` blob,      `outlinks` mediumblob,      `fetchTime` bigint(20) DEFAULT NULL,      `retriesSinceFetch` int(11) DEFAULT NULL,      `protocolStatus` blob,      `signature` blob,      `metadata` blob,      PRIMARY KEY (`id`)      ) ENGINE=InnoDB      ROW_FORMAT=COMPRESSED      DEFAULT CHARSET=utf8mb4;

1.安装nutch2.1
File > New > Project > SVN > Checkout Projects from SVN
Create new repository location > https://svn.apache.org/repos/asf/nutch/tags/release-2.1
checkout 源代码 as a project configured using the New Project Wizard
最后点击finish根据提示选择Java > Java Project > next
为项目取个名字（随意取，无限制，本人这里取nutch2.1，下面会用到）其他全部按照默认走就可以了，下载nutch2.1的源代码。

2.创建nutch2.1的编译环境
在project explorer下右击项目，选择properties。进入java build path
The only Source folder will be nutch2.1（之前自己取的名字） /src> Remove this folder > Add Folder > expand trunk/src and check src/bin, src/java, src/test & src/testresources.

我们必须手动添加 plugin src/java and src/test folder,虽然这个户花费我们不少时间，但是这个是必须要做的。
在 Libraries tab, 点击 Add Class Folder 并添加 nutch2.1/conf 到 classpath.
还是在 Libraries tab add JARs > src/plugin/urlfilter-automaton/lib/automaton.jar & src/plugin/parse-swf/lib/javaswf.jar

在 Libraries tab Add Library > IvyDE Managed Dependencies > browse to nutch2.1/ivy/ivy.xml
"Order and Export" tab 找到src/conf选中并点击top，移动到最顶端。

这些配置完成。IvyDE会自动加载依赖的jar包。可能会出现报错（如果网络不好的话）
然后就算没有报错，我们仍然可以看到nutch2.1中有很多红叉。

可以先放着。接下来要做的是配置编译环境变量。
在nutch2.1/conf下
Gora.properties
加入：

    gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver      gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true      gora.sqlstore.jdbc.user=root      gora.sqlstore.jdbc.password=root

并注释掉其他的数据库链接。
在ivy/ivy.xml
解除mysql-connector的注释。

在/conf/nutch-site.xml.template的configuration中添加如下代码：

    <property>      <name>http.agent.name</name>      <value>Your Nutch Spider</value>      </property>            <property>      <name>http.accept.language</name>      <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>      <description>Value of the “Accept-Language” request header field.      This allows selecting non-English language as default one to retrieve.      It is a useful setting for search engines build for certain national group.      </description>      </property>            <property>      <name>parser.character.encoding.default</name>      <value>utf-8</value>      <description>The character encoding to fall back to when no other information      is available</description>      </property>            <property>        <name>plugin.includes</name>       <value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>       <description>Regular expression naming plugin directory names to        include.  Any plugin not matching this expression is excluded.        In any case you need at least include the nutch-extensionpoints plugin. By        default Nutch includes crawling just HTML and plain text via HTTP,        and basic indexing and search plugins. In order to use HTTPS please enable         protocol-httpclient, but be aware of possible intermittent problems with the         underlying commons-httpclient library.        </description>      </property>            <property>      <name>storage.data.store.class</name>      <value>org.apache.gora.sql.store.SqlStore</value>      <description>The Gora DataStore class for storing and retrieving data.      Currently the following stores are available: ….      </description>      </property>            <property>        <name>plugin.folders</name>        <value>./src/plugin</value>        <description>Directories where nutch plugins are located.  Each        element may be a relative or absolute path.  If absolute, it is used        as is.  If relative, it is searched for on the classpath.</description>      </property>

在根目录下的build.xml中找到如下代码

    <target name="resolve-default" depends="clean-lib, init" description="--> resolve and retrieve dependencies with ivy">        <ivy:resolve file="${ivy.file}" conf="default" log="download-only" />        <ivy:retrieve pattern="${build.lib.dir}/[artifact]-[revision].[ext]" symlink="false" log="quiet" />        <antcall target="copy-libs" />       </target>

将原本的

    pattern="${build.lib.dir}/[artifact]-[revision].[ext]"

改为

    pattern="${build.lib.dir}/[artifact]-[type]-[revision].[ext]"

用来避免ivy再次下载编译不通过的情况。原因：ivy会下载class的jar和source的jar，当时如果直接按照上面的pattern下载的话，两个文件是无法区分的。会出现相同的文件的错误。

完成如上信息之后，点击build.xml进行ant编译就会生成runtime目录。

3. 创建debug环境。
由于我们的代码中有很多红叉，现在我们就来消除它。

首先，我们需要去runtime/local/plugins下的各个包中寻找没有加的jar这些jar是各个plugin自己需要的，通过他们自己的ivy加载进来的。用来消除红叉，到最后会发现肯定还是有那么四个总是红色的。Parser-js，Parser-ext, Parser-swf, Parser-zip
这个就是nutch项目的问题了。因为他们引用的是nutch1.X的包中的类，所以这里有错误。源代码的开发者不知道为什么没有把这些更新。一般我们做初步调试时不需要用到这些插件的。所以可以直接从source path将他们remove。

在根目录下添加一个urls文件夹，放入seed.txt文件，其中加一个网站地址。如：http://nutch.apache.org/
打开src/java下的crawl的package下的crawler，使用run configuration
第一页已经默认填写完毕
选择第二个arguments
放入：urls -depth 3 -topN 5
最后就可以使用run进行爬取该网站的链接信息了。

参考文献：
http://nlp.solutions.asia/?p=180
http://wiki.apache.org/nutch/RunNutchInEclipse

写在后面：
Nutch的官方文档和源代码的管理很让人失望。官方文档中有不少错误的或者说介绍不明确的地方。直接按照官方文档来操作死活弄不出来的事情时有发生。然后是源代码的管理，自nutch从1.x升级到2.x，其中除了parser-html外的所有其他parser插件如js，ext，zip，swf等都改用parser-tika了。但是这些插件却没有在nutch2.1的源码中删除，如果直接添加进来，指向不明确的话，永远都会有报错。最后就是nutch2.1的文档太少，自己需要投入相当的精力才可以把这个环境搭建好。希望官方可以好好的把这些问题解决掉。

其他补充：
1. 如何添加合适的plug
由于我们有一个plugin include在nutch-site中，只要根据这个去看哪些是我们某人需要添加的plugin 在add source 的时候可以只加这些必须的plugin。自己也可以编写plugin为自己的nutch使用。可以参考如何编写plugin的文档。

2. 如何在windows下执行。
如上信息其实大多没有linux和windows的区别。（你耍我？，那还要linux干嘛？）只是，由于nutch是基于hadoop的，而我们知道hadoop只能在linux上运行。在linux上运行没有问题，到windows下就会问题不断。其中一个重点问题就是linux和windows的路径名是不同的。所以在上面的配置中plugin folder这里的路径就要调整为src/plugin了。Mysql安装完成后，对于linux下的/etc/my.cnf是windows的mysql安装目录下面的一个my.ini文件，所以之前添加到/etc/my.cnf的信息需要添加到这个my.ini中。还有就是hadoop默认使用的是linux的路径形式。所以需要修改hadoop的源码。大家可以到网上去下载一个hadoop-1.0.2-modified.jar用来代替ivy为我们下载的hadoop-1.0.3。该jar解决了linux下的所有问题，可以用来windows下使用。所以修改完以上，我们也是可以在windows下使用nutch2.1的。

3. Mysql下的/etc/my.cnf参数复制好后就无法重启mysql了
那是因为你复制的参数在该文件中已经存在了。Mysql是不允许出现两份一样的配置的。所以检查一下有没有哪个参数在配置文件中已经有了，删除它即可。我遇到的问题就是character-set-server已经被设置成了utf8。修改成utf8mb4即可。

0 0