Web Crawling and Data Miniing with Apache Nutch(翻译+学习心得)_01

来源：互联网发布：养树的软件编辑：程序博客网时间：2024/06/06 07:05

笨小葱会在这两个月翻译完这本传说中的418元一本的神作。0.0.由于英语很烂，只能说个笨小葱理解的大概意思，很多地方翻译不到位请各位大拿指出，我会及时更正的。请多多包涵0.0

Preface

Apache Nutch is an open source web crawler software that is used for crawling

websites. It is extensible and scalable. It provides facilities for parsing, indexing, and

scoring filters for custom implementations. This book is designed for making you

comfortable in applying web crawling and data mining in your existing application.

It will demonstrate real-world problems and give the solutions to those problems

with appropriate use cases.

This book will demonstrate all the practical implementations hands-on so readers

can perform the examples on their own and make themselves comfortable. The

book covers numerous practical implementations and also covers different types

of integrations.

Apache Nutch是一个用来爬取网站的开源网站爬取软件。他是可扩展的和可伸缩的。（nutch）提供了分析工具，索引和评分过滤器的自定义实现。

这本书被设计用来让你更轻松地应用网站爬取和数据挖掘于你的现有项目中。它将展示实际问题并且通过适当的用例给出这些问题的解决方案。

这本书将会展示实际实现以便于读者能够应用这写例子在他们自己的项目中，让读者使用的轻松。这本书包含许多实际实现，也包含了不同类型的集成教程

1.Getting Started with Apache Nutch

Apache Nutch is a very robust and scalable tool for web crawling; it can be

integrated with the scripting language Python for web crawling. You can use it

whenever your application contains huge data and you want to apply crawling on

your data.

Apache Nutch是一个非常健壮的和可扩展的网络爬取工具。它能够和脚本语言Python集成进行网络爬去。当你的应用包含大量数据，并且你想要应用爬取在你的数据上时，你就可以使用它（nutch）。

This chapter covers the introduction to Apache Nutch and its installation, and also

guides you on crawling, parsing, and creating plugins with Apache Nutch. It will

start from the basics of how to install Apache Nutch and then will gradually take you

to the crawling of a website and creating your own plugin.

这一章包括Apache Nutch的介绍，它的安装和指导你进行爬取，分析和创建Apache Nutch插件。我们将从基础的Apache Nutch安装开始，然后

逐步带你爬取一个网站和创建你自己的插件。

In this chapter we will cover the following topics:

• Introducing Apache Nutch

• Installing and configuring Apache Nutch

• Verifying your Nutch installation

• Crawling your first website

• Setting up Apache Solr for search

• Integrating Solr with Nutch

• Crawling websites using crawl script

• Crawling the web, URL filters, and the CrawlDb

• Parsing and parsing filters

• Nutch plugins and Nutch plugin architecture

在这一章节我们将包含如下主题:

<1>介绍Apache Nutch

<2>安装和配置Apache Nutch

<3>校验你的Nutch安装

<4>爬取你的第一个网站

<5>安装Apache Solr搜索引擎

<6>集成Solr和nutch

<7>使用crawl脚本爬取网站

<8>爬取网站，URL过滤和CRAWLDb

<9>分析和分析过滤器

<10>nutch插件和nutch插件架构

By the end of this chapter, you will be comfortable playing with Apache Nutch as

you will be able to configure Apache Nutch yourself in your own environment and

you will also have a clear understanding about how crawling and parsing take place

with Apache Nutch. Additionally, you will be able to create your own Nutch plugin.

学到这一章的最后，你将能够很轻松的在你自己的环境中独立配置 Apache Nutch，你也会有一个关于Apache Nutch是怎样爬取和分析的清楚的理解，

另外，你也能够创建你自己的nutch插件。

<1>Introduction to Apache Nutch（Apache Nutch介绍）

Apache Nutch is open source WebCrawler software that is used for crawling

websites. You can create your own search engine like Google, if you understand

Apache Nutch clearly. It will provide you with your own search engine, which can

increase your application page rank in searching and also customize your application

searching according to your needs. It is extensible and scalable. It facilitates parsing,

indexing, creating your own search engine, customizing search according to needs,

scalability, robustness, and ScoringFilter for custom implementations. ScoringFilter

is a Java class that is used while creating the Apache Nutch plugin. It is used for

manipulating scoring variables.

Apache Nutch是一个开源的用来爬取网站的网络爬虫软件。如果你清楚地理解了Apache Nutch，你可以创建你自己的像Google一样的搜索引擎。它能够提供给你一个你自己的搜索引擎（能够在搜索中增加你的应用网页分数和根据你的需求定制你自己的应用搜索方式）。他是可扩展的和可伸缩的。它能够很容易的分析，索引，创建你自己的搜索引擎，根据需求定制搜索，可扩展性，健壮性和评分过滤器的用户化实现。评分过滤器是创建Apache Nutch插件时的一个java类，被用来操作评分变量。

We can run Apache Nutch on a single machine as well as on a distributed

environment such as Apache Hadoop. It is written in Java. We can find broken links

using Apache Nutch and create a copy of all the visited pages for searching over,

for example, while building indexes. We can find web page hyperlinks in an

automated manner.

我们可以运行Apache Nutch在一个单机模式下，也可以在一个分布式环境中，如:Apache Hadoop（它是用java编写的）。我们可以使用Apache Nutch找到无效连接和创建一个所以搜索浏览过的网页的副本，例如，创建索引的话，我们就能够通过自动化的方式找到网页连接。

Apache Nutch can be integrated with Apache Solr easily and we can index all the

web pages that are crawled by Apache Nutch to Apache Solr. We can then use

Apache Solr for searching the web pages which are indexed by Apache Nutch.

Apache Solr is a search platform that is built on top of Apache Lucene. It can be

used for searching any type of data, for example, web pages.

Apache Nutch能够很容易的集成Apache Solr，我们能够索引所有的Apache Nutch爬取的网页给Apache Solr。然后，我们可以使用Apache Solr来搜索这些网页，Apache Solr是一个建立在Apache Lucene之上的搜索平台。它能够用来搜索任何类型的数据，如:网页。

<2>Installing and configuring Apache Nutch（安装和配置Apache Nutch）

In this section, we are going to cover the installation and configuration steps of

Apache Nutch. So we will first start with the installation dependencies in Apache

Nutch. After that, we will look at the steps for installing Apache Nutch. Finally, we

will test Apache Nutch by applying crawling on it.

这一节中，包括了Apache Nutch的安装和配置步骤。首先我们要安装Apache Nutch的依赖软件。之后，我们将一步步的安装Apache Nutch。最后，我们将通过爬取来测试Apache Nutch是否安装成功。

Installation dependencies（安装相关依赖）

The dependencies are as follows:（依赖如下：）

• Apache Nutch 2.2.1

• HBase 0.90.4

• Ant

• JDK 1.6

Apache Nutch comes in different branches, for example, 1.x, 2.x, and so on. The key

difference between Apache Nutch 1.x and Apache Nutch 2.x is that in the former,

we have to manually type each command step-by-step for crawling, which will be

explained later in this chapter. In the latter, Apache Nutch developers create a crawl

script that will do crawling for us by just running that script; there is no need to type

commands step-by-step.

Apache Nutch 发布了不同的版本，如：1.x,2.x等等。

There may be more differences but I have covered just one. nutch1.x和nutch2.x主要不同之处在于他们的模型。我们不得不手工的一步一步的执行爬取命令。后来，Apache Nutch 开发者创建了一个crawl脚本（运行这个脚本就能够一次执行完爬取命令），就没必要一步一步的执行了。

There may be more differences but I have covered just one.

I have used Apache Nutch 2.2.1 because it is the latest version at the time of

writing this book. The steps for installation and configuration of Apache Nutch

are as follows:

还有更多的不同，这里就举这一个例子。

我事后的是最新版本的Apache Nutch 2.2.1。安装和配置步骤如下：

（首先需要安装好jdk1.6以上版本和ant，如果不知道如何安装请参考笨小葱的这篇博客：http://blog.csdn.net/sunshine920103/article/details/46777981）

1. Download Apache Nutch from the Apache website. You may download

Nutch from http://nutch.apache.org/downloads.html.

从http://nutch.apache.org/downloads.html.下载 Apache Nutch。（现在最新的是2.3版本。这里如果使用2.3版本，后面的与mysql集成会出现一些问题，所以笨小葱建议小伙伴们，还是先用2.2.1版本的nutch。在上面的下载页面里，往下拉有个，点击链接就能找到历届nutch版本）

2. Click on apache-nutch-2.2.1-src.tar.gz under the Mirrors column in the

Downloads tab. You can extract it by typing the following commands:

#cd $NUTCH_HOME

# tar –zxvf apache-nutch.2.2.1-src.tar.gz

Here $NUTCH_HOME is the directory where your Apache Nutch resides.

下载了tar.gz文件后，进入到文件存放的目录，运行tar –zxvf apache-nutch.2.2.1-src.tar.gz命令，解压缩文件。

3. Download HBase. You can get it from

http://archive.apache.org/dist/hbase/hbase-0.90.4/.

HBase is the Apache Hadoop database that is distributed, a big data store,

scalable, and is used for storing large amounts of data. You should use

Apache HBase when you want real-time read/write accessibility of your big

data. It provides modular and linear scalability. Read and write operations

are very consistent. Here, we will use Apache HBase for storing data, which

is crawled by Apache Nutch. Then we can log in to our database and access it

according to our needs.

下载HBase。你可以从这里下载到http://archive.apache.org/dist/hbase/hbase-0.90.4/ HBase是分布式的 Apache Hadoop数据库，可扩展的用来存储大量数据的大数据存储容器。当你要实时的读取你的数据时可以使用Apache HBase。它提供模块化的和线性的扩展性。读取操作非常一致。这里我们使用Apache HBase来存储 Apache Nutch爬取的数据。然后，我们可以登录我们的数据库得到数据根据我们的需求。

4. We now need to extract HBase, for example, Hbase.x.x.tar.gz. Go to the

terminal and reach up to the path where your Hbase.x.x.tar.gz resides.

Then type the following command for extracting it:

tar –zxvf Hbase.x.x.tar.gz

It will extract all the files in the respective folder.

我们需要去提取HBase。进入终端，到达Hbase.x.x.tar.gz文件存放的路径，执行下面的命令提取：

tar –zxvf Hbase.x.x.tar.gz

5. Now we need to do HBase configuration. First, go to hbase-site.xml,

which you will find in <Your HBase home>/conf and modify it as follows:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>hbase.rootdir</name>

<!— You need to create one directory and assign a path up to that

directory. That directory will be used by Apache Hbase to store

all relevant information</property>

<name>hbase.zookeeper.property.dataDir</name>

<!— You need to create one directory and assign a path up to

that directory. That directory will be used by Apache Hbase

to store all relevant information related to Apache zookeeper

which comes inbuilt with Apache Hbase. Apache Zookeeper is an

open source server which is used for distributed coordination.

You can learn more about Apache Zookeeper from

https://cwiki.apache.org/confluence/display/ZOOKEEPER/Index

</property>

</configuration>

Just make sure that the hosts file under etc contains the loop back address,

which is 127.0.0.1 (in some Linux distributions, it might be 127.0.1.1).

Otherwise you might face an issue while running Apache HBase.

现在我们需要去配置HBase。首先找打这个文件hbase-site.xml，

它存在HBase的根目录下，找到并修改如下：

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>hbase.rootdir</name>
<value><Your path></value>(这里的路径，是用来存储HBase所有相关信息，你可以指定或创建一个路径)

</property>

<name>hbase.zookeeper.property.dataDir</name>

<value><Your path></value> （这里的路径，是HBase用来存储和Apache zookeeper所有相关信息，你可以指定或创建一个路径）

</property>

</configuration>

确保/etc/hosts文件的回调地址是127.0.0.1,不然可能会出错。

6. Specify Gora backend in nutch-site.xml. You will find this file at $NUTCH_

HOME/conf.

<name>storage.data.store.class</name>

<value>org.apache.gora.hbase.store.HBaseStore</value>

<description>Default class for storing data</description>

</property>

The explanation of the preceding configuration is as follows:

°?nbsp;Find the name of the data store class for storing data of

Apache Nutch:

<name>storage.data.store.class</name>

°?nbsp;Find the database in which all the data related to HBase will reside:

<value>org.apache.gora.hbase.store.HBaseStore</value>

在nutch-site.xml中指定gora后端。该文件在nutch根目录的conf目录下，修改如下：

<name>storage.data.store.class</name>(Apache Nutch存储数据的类名)

<value>org.apache.gora.hbase.store.HBaseStore</value>（指定HBase数据库）

<description>Default class for storing data</description>

</property>

7. Make sure that the HBasegora-hbase dependency is available in ivy.xml.

You will find this file in <Your Apache Nutch home>/ivy. Put the following

configuration into the ivy.xml file:

<dependency org="org.apache.gora" name="gora-hbase" rev="0.2"

conf="*-

>default" />

The last line would be commented by default. So you need to uncomment it.

在nutch根目录下的/ivy下的文件ivy.xml中，取消掉如下注释：

<dependency org="org.apache.gora" name="gora-hbase" rev="0.2"

conf="*-

>default" />

to be continued..................

0 0