Nutch 1.0配置与运行
来源:互联网 发布:小鸣单车 知乎 编辑:程序博客网 时间:2024/05/18 03:23
http://blog.csdn.net/shirdrn/article/details/5922087
有一段时间没有关注Nutch了,借最近假期,尝试一下,以后可能会慢慢深入了解Nutch及其相关的内容(像Hadoop),话不多说,先把最简单配置的Nutch运行起来。
这次下载了Nutch 1.0,貌似和之前版本在配置上有轻微的变化。
由于Nutch基于Hadoop项目,肯定需要满足Hadoop运行的一些配置,为了更能详细地说明实际的配置和运行过程,还是采用按步骤进行解释的方式来表达。我直接使用root帐户了,Linux为RHEL 5。
1、Linux系统无密码验证配置
启动命令行窗口,执行命令行:
- $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
- $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
然后执行如下命令:
- $ ssh localhost
如果不需要使用密码登录,表示无密码验证配置成功。
2、配置Hadoop
(1)修改conf/hadoop-env.sh中JAVA_HOME
(2)修改conf/hadoop-site.xml的内容,如下所示:
- <?xml version="1.0"?>
- <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
- <!-- Put site-specific property overrides in this file. -->
- <configuration>
- <property>
- <name>fs.default.name</name>
- <value>hdfs://localhost:9000</value>
- </property>
- <property>
- <name>mapred.job.tracker</name>
- <value>localhost:9001</value>
- </property>
- <property>
- <name>dfs.replication</name>
- <value>1</value>
- </property>
- </configuration>
3、配置Nutch
(1)配置conf/nutch-site.xml,内容如下所示:
- <?xml version="1.0"?>
- <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
- <!-- Put site-specific property overrides in this file. -->
- <configuration>
- <property>
- <name>http.agent.name</name>
- <value>z-hacker</value>
- </property>
- <property>
- <name>http.agent.description</name>
- <value>I'm z-hacker.</value>
- </property>
- <property>
- <name>http.agent.url</name>
- <value>www.z-hacker.com</value>
- </property>
- <property>
- <name>http.agent.email</name>
- <value>crawler@z-hacker.com</value>
- </property>
- </configuration>
其中,http.agent.name是必须配置的,其它为可选。
(2)配置conf/crawl-urlfilter.txt,修改如下一行即可:
- # accept hosts in MY.DOMAIN.NAME
- +^http://([a-z0-9]*/.)*sina.com.cn/
上述配置,表示抓取新浪的网页。
4、准备Nutch爬虫入口url文件
在当前目录下创建一个目录,例如urls,在url下面可以创建文本文件,文本文件中每行一个合法的url即可。
5、格式化HDFS文件系统
- [root@localhost nutch-1.0]# bin/hadoop namenode -format
如果第一次在本机Linux上运行,则不出错就应该没问题;如果第一次执行格式化,可能提示你是否重新格式化:
Re-format filesystem in /tmp/hadoop-root/dfs/name ? (Y or N) Y
输入Y即可,重新格式化。
6、启动Hadoop相关的5个后台进程
[root@localhost nutch-1.0]# bin/start-all.sh
如果没有出错,表示成功,此时启动了5个后台进程,可以通过jps命令查看一下,是否是下面5个进程:
[root@localhost nutch-1.0]# jps
15559 Jps
15407 JobTracker
15243 DataNode
15150 NameNode
15512 TaskTracker
15349 SecondaryNameNode
如果缺少某个,说明你的配置还是存在问题,需要查看日志。
例如,我的日志目录为logs,假如通过jps查看发现NameNode进程不存在,说明NameNode服务进程启动失败,查看/logs/hadoop-root-namenode-localhost.log日志文件,即可查明原因。
7、上传Nutch爬虫入口url目录
需要将本地准备好的爬虫入口目录(例如urls)及文件(目录urls下面存在url文件)上传到HDFS上,执行如下命令:
[root@localhost nutch-1.0]# bin/hadoop fs -put urls/ urls
如果没有发生异常,则上传成功。
8、简单模式启动Nutch
执行如下命令启动Nutch:
- [root@localhost nutch-1.0]# bin/nutch crawl urls -dir storeDir -depth 10 -threads 5 -topN 50 >&./logs/nutch.log
上面,urls是上传的入口url的目录,storeDir是抓取下来的数据存储目录。
如果执行上述命令,查看日志,能够按照正确的流程执行,就表示配置运行成功了,可以通过如下命令查看日志:
- [root@localhost nutch-1.0]# tail -100f logs/nutch.log
- crawl started in: storeDir
- rootUrlDir = urls
- threads = 5
- depth = 10
- topN = 50
- Injector: starting
- Injector: crawlDb: storeDir/crawldb
- Injector: urlDir: urls
- Injector: Converting injected urls to crawl db entries.
- Injector: Merging injected urls into crawl db.
- Injector: done
- Generator: Selecting best-scoring urls due for fetch.
- Generator: starting
- Generator: segment: storeDir/segments/20101004091724
- Generator: filtering: true
- Generator: topN: 50
- Generator: Partitioning selected urls by host, for politeness.
- Generator: done.
- Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
- Fetcher: starting
- Fetcher: segment: storeDir/segments/20101004091724
- Fetcher: done
- CrawlDb update: starting
- CrawlDb update: db: storeDir/crawldb
- CrawlDb update: segments: [storeDir/segments/20101004091724]
- CrawlDb update: additions allowed: true
- CrawlDb update: URL normalizing: true
- CrawlDb update: URL filtering: true
- CrawlDb update: Merging segment data into db.
- CrawlDb update: done
- Generator: Selecting best-scoring urls due for fetch.
- Generator: starting
- Generator: segment: storeDir/segments/20101004091902
- Generator: filtering: true
- Generator: topN: 50
- Generator: Partitioning selected urls by host, for politeness.
- Generator: done.
- Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
- Fetcher: starting
- Fetcher: segment: storeDir/segments/20101004091902
- Fetcher: done
- CrawlDb update: starting
- CrawlDb update: db: storeDir/crawldb
- CrawlDb update: segments: [storeDir/segments/20101004091902]
- CrawlDb update: additions allowed: true
- CrawlDb update: URL normalizing: true
- CrawlDb update: URL filtering: true
- CrawlDb update: Merging segment data into db.
- CrawlDb update: done
- Generator: Selecting best-scoring urls due for fetch.
- Generator: starting
- Generator: segment: storeDir/segments/20101004092053
- Generator: filtering: true
- Generator: topN: 50
- Generator: Partitioning selected urls by host, for politeness.
- Generator: done.
- Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
- Fetcher: starting
- Fetcher: segment: storeDir/segments/20101004092053
- Fetcher: done
- CrawlDb update: starting
- CrawlDb update: db: storeDir/crawldb
- CrawlDb update: segments: [storeDir/segments/20101004092053]
- CrawlDb update: additions allowed: true
- CrawlDb update: URL normalizing: true
- CrawlDb update: URL filtering: true
- CrawlDb update: Merging segment data into db.
- CrawlDb update: done
- Generator: Selecting best-scoring urls due for fetch.
- Generator: starting
- Generator: segment: storeDir/segments/20101004092711
- Generator: filtering: true
- Generator: topN: 50
- Generator: Partitioning selected urls by host, for politeness.
- Generator: done.
- Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
- Fetcher: starting
- Fetcher: segment: storeDir/segments/20101004092711
- Fetcher: done
- CrawlDb update: starting
- CrawlDb update: db: storeDir/crawldb
- CrawlDb update: segments: [storeDir/segments/20101004092711]
- CrawlDb update: additions allowed: true
- CrawlDb update: URL normalizing: true
- CrawlDb update: URL filtering: true
- CrawlDb update: Merging segment data into db.
- CrawlDb update: done
- Generator: Selecting best-scoring urls due for fetch.
- Generator: starting
- Generator: segment: storeDir/segments/20101004092833
- Generator: filtering: true
- Generator: topN: 50
- Generator: 0 records selected for fetching, exiting ...
- Stopping at depth=4 - no more URLs to fetch.
- LinkDb: starting
- LinkDb: linkdb: storeDir/linkdb
- LinkDb: URL normalize: true
- LinkDb: URL filter: true
- LinkDb: adding segment: hdfs://localhost:9000/user/root/storeDir/segments/20101004091724
- LinkDb: adding segment: hdfs://localhost:9000/user/root/storeDir/segments/20101004091902
- LinkDb: adding segment: hdfs://localhost:9000/user/root/storeDir/segments/20101004092053
- LinkDb: adding segment: hdfs://localhost:9000/user/root/storeDir/segments/20101004092711
- LinkDb: done
- Indexer: starting
- Indexer: done
- Dedup: starting
- Dedup: adding indexes in: storeDir/indexes
- Dedup: done
- merging indexes to: storeDir/index
- Adding hdfs://localhost:9000/user/root/storeDir/indexes/part-00000
- done merging
- crawl finished: storeDir
9、查看Nutch执行结果
可以根据上面我们指定的数据存储目录来查看HDFS上存储情况,执行如下命令:
- [root@localhost nutch-1.0]# bin/hadoop fs -ls /user/root/storeDir
- Found 5 items
- drwxr-xr-x - root supergroup 0 2010-10-04 09:28 /user/root/storeDir/crawldb
- drwxr-xr-x - root supergroup 0 2010-10-04 09:31 /user/root/storeDir/index
- drwxr-xr-x - root supergroup 0 2010-10-04 09:30 /user/root/storeDir/indexes
- drwxr-xr-x - root supergroup 0 2010-10-04 09:29 /user/root/storeDir/linkdb
- drwxr-xr-x - root supergroup 0 2010-10-04 09:27 /user/root/storeDir/segments
- Nutch 1.0配置与运行
- Nutch 1.0配置与运行
- nutch 【配置与运行】
- 三,nutch 1.0 爬虫配置与运行
- Nutch-1.0配置
- Nutch的安装与配置
- nutch 1.6 安装与配置
- nutch安装配置运行时,一些常见的错误与解决方法
- NUTCH的分布式部署与运行
- Nutch 1.0 配置 笔记 安装 超简单
- Eclipse中编译nutch-1.0配置详解
- eclipse运行nutch-1.7
- Nutch 运行错误
- nutch运行时参数设置
- nutch运行问题1
- 【Nutch】Nutch-2.3 + HBase-0.94.14 + Solr-4.10.4 集成配置与安装
- Linux下的Nutch分布式配置与安装
- Nutch 0.9分布式配置
- WP7的Tombstone机制(墓碑机制)。多任务开发研究
- 语句判断时,如何不区分字符串的大小写
- 线程的概念
- 读《白班程序员 VS. 夜猫子程序员》有感
- Linux下LCD自动关闭解决方法
- Nutch 1.0配置与运行
- linux下安装gcc交叉编译工具及其“浮点数例外”
- HDOJ 1213 How Many Tables
- 线程状态和调度
- The Tao Of Programming —— 编程之道
- Ubuntu11.10 下安装 jdk
- 程序员之路──关于代码风格
- if……elseif……else……end if 的用法
- Linux的iptables做代理服务器和防火墙详解