Nutch2.3.1+HBase单机版

来源：互联网发布：淘宝试用报告草稿在哪编辑：程序博客网时间：2024/04/24 13:23

在nutch的官网有这么一段话：
网址：https://wiki.apache.org/nutch/Nutch2Tutorial

Download and configure HBase 0.98.8-hadoop2. You can get it here (N.B. Each version of Gora is tied to a particular version of HBase, we therefore suggest you use this version if possible. If you decide to use another version of HBase please do not be surprised if the stack does not work. You should also obtain current documentation for HBase however please again take into consideration that the version of HBase we recommend you use may not correlate to the current documentation. Please keep this in mind and use your initiative.

Each version of Gora is tied to a particular version of HBase, we therefore suggest you use this version if possible：每个版本的gora绑定了对应的HBase版本，所以我们建议，如果可能的话，请你使用该版本。
If you decide to use another version of HBase please do not be surprised if the stack does not work.：
如果你打算用别的版本的HBase，如果nutch不能正常运行的话请不要感到惊讶。
最新版本的gora为0.7，但是nutch2.3是gora0.6.1
这里写图片描述

平台

Nutch2.3.1+HBase 0.98.8-hadoop2+Hadoop2.5.2

我的系统为Centos7。支持Linux平台,用ubantu也可以的。

第一：安装ant

yum install ant

第二：安装hadoop

下载安装包后解压然后进行下面步骤

vim /opt/hadoop-2.5.2/etc/hadoop/core-site.xml

在configuration中添加

<!--指定HDFS的NameNode地址--><property>    <name>fs.defaultFS</name>    <value>hdfs://localhost:9000</value></property><!--用来指定Hadoop用来运行时产生文件的存放目录--><property>    <name>hadoop.tmp.dir</name>    <value>/opt/hadoop-2.8.0/tmp</value></property>

vim /opt/hadoop-2.5.2/etc/hadoop/hadoop-env.sh

修改成你$JAVA_HOME的目录

# The java implementation to use.export JAVA_HOME=/lib/jvm/java-1.7.0

vim /opt/hadoop-2.5.2/etc/hadoop/hdfs-site.xml
在configuration中添加

<!--指定HDFS的保存数据副本的数量--> <property>    <name>dfs.replication</name>    <value>1</value> </property> <property>    <name>dfs.namenode.name.dir</name>    <value>/opt/hadoop-2.5.2/dfs/name</value> </property> <property>    <name>dfs.datanode.data.dir</name>    <value>/opt/hadoop-2.5.2/dfs/data</value> </property> <property>    <name>dfs.permissions</name>    <value>false</value> </property>

vim /opt/hadoop-2.5.2/etc/hadoop/mapred-site.xml

在configuration中添加

<!--告诉Hadoop以后MR运行在yarn上--><property>        <name>mapreduce.frameword.name</name>        <value>yarn</value></property>

vim /opt/hadoop-2.5.2/etc/hadoop/yarn-site.xml

在configuration中添加

<property>        <name>yarn.nodemanager.aux-services</name>        <value>mapreduce_shuffle</value></property>

第三：安装hbase

vim /opt/hbase-0.98.8/conf/hbase-env.sh

修改成你$JAVA_HOME的位置

#The java implementation to use.  Java 1.6 required.export JAVA_HOME=/lib/jvm/java-1.7.0

vim /opt/hbase-0.98.8/conf/hbase-site.xml

添加

<configuration><property>        <name>hbase.rootdir</name>        <value>hdfs://localhost:9000/hbase</value>    </property>    <property>        <name>hbase.cluster.distributed</name>        <value>true</value>    </property></configuration>

添加环境变量

vim /etc/profile

添加

export JAVA_HOME=/lib/jvm/java-1.7.0export HADOOP_HOME=/opt/hadoop-2.5.2export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin

然后

source /etc/profile

使得环境变量生效

Nutch安装步骤

Step1：

在官网：
http://archive.apache.org/dist/nutch/2.3.1/
点击下载apache-nutch-2.3.1-src.tar.gz，然后解压。
命令：

tar -zxvf apache-nutch-2.3.1-src.tar.gz

Step2:

解压后得到文件apache-nutch-2.3.1。

vim /apache-nutch-2.3.1/ivysettings.xml

找到

Step3:

vim /apache-nutch-2.3.1/conf/nutch-site.xml

添加以下属性。

<property>    <name>storage.data.store.class</name>    <value>org.apache.gora.hbase.store.HBaseStore</value>    <description>Default class for storing data</description>    </property>  <property>    <name>http.agent.name</name>    <value>My Nutch Spider</value>  </property>  <property>  <name>plugin.includes</name>  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|index-anchor|index-more|languageidentifier|subcollection|feed|creativecommons|tld</value>  <description>Regular expression naming plugin directory names toinclude.  Any plugin not matching this expression is excluded.In any case you need at least include the nutch-extensionpoints plugin. Bydefault Nutch includes crawling just HTML and plain text via HTTP,and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library.</description></property>

Step4:

vim /apache-nutch-2.3.1/ivy/ivy.xml

把该注释去掉

<dependency org="org.apache.gora" name="gora-hbase" rev="0.6.1" conf="*->default" />

Step5:

vim /apache-nutch-2.3.1/conf/gora.properties

添加

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

Step6:

在/apache-nutch-2.3.1目录下

ant runtime -verbose

等待，让包下载完。加上-verbose目的是为了可以看下载过程。时间根据网速快慢而定。10分钟-1小时不等。

记得要

cp hbase/lib/hbase*.jar /nutch-2.3.1/runtime/local/lib

否则会报hbaseconfiguration错误

然后

mkdir /opt/urls

cd /opt/urls

touch seeds.txt

echo www.amazon.com > seeds.txt

首先启动hadoop

/opt/hadoop-2.5.2/sbin/start-all.sh

然后启动hbase

/opt/hbase-0.98.8/bin/start-hbase.sh

最后爬虫程序启动

/opt/apache-nutch-2.3.1/runtim/local/bin/craw /opt/urls tablename 2

Nutch的eclipse搭建是直接上官网看的，原文地址是：
这个是eclipse的搭建步骤：可以先忽略
https://wiki.apache.org/nutch/RunNutchInEclipse

阅读全文

0 0