Nutch1.0 日志分析(转)

来源:互联网 发布:c语言字符串最大长度 编辑:程序博客网 时间:2024/06/06 09:54

Hadoop集群创建文件

[nutch@gc01vm13 /]$ cd ./home/nutch/nutchinstall/nutch-1.0/

[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -ls

Found 1 items

drwxr-xr-x   - nutch supergroup          0 2010-06-09 20:10 /user/nutch/zklin

[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -mkdir crawldatatest   //mkdir 的地址相对于/user/nutch

[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -ls

Found 2 items

drwxr-xr-x   - nutch supergroup          0 2010-06-11 00:40 /user/nutch/crawldatatest

drwxr-xr-x   - nutch supergroup          0 2010-06-09 20:10 /user/nutch/zklin

 

[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -mkdir urls

[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -ls

Found 3 items

drwxr-xr-x   - nutch supergroup          0 2010-06-11 00:40 /user/nutch/crawldatatest

drwxr-xr-x   - nutch supergroup          0 2010-06-11 00:45 /user/nutch/urls

drwxr-xr-x   - nutch supergroup          0 2010-06-09 20:10 /user/nutch/zklin

 

[nutch@gc01vm13 nutchinstall]$ mkdir urls   //先在本地创建urls文件夹

[nutch@gc01vm13 nutchinstall]$ cd ./urls/

[nutch@gc01vm13 urls]$ ls

[nutch@gc01vm13 urls]$ vim urls1           //url入口地址,urls1

[nutch@gc01vm13 urls]$ cd ..

[nutch@gc01vm13 nutchinstall]$ cd ./nutch-1.0/

[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -copyFromLocal /home/nutch/nutchinstall/urls/urls1 urls   //从本地拷贝到集群  集群是相对路径 

 

[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -lsr  //在集群上查看urls1

drwxr-xr-x   - nutch supergroup          0 2010-06-11 00:40 /user/nutch/crawldatatest

drwxr-xr-x   - nutch supergroup          0 2010-06-11 00:46 /user/nutch/urls

-rw-r--r--   2 nutch supergroup         31 2010-06-11 00:46 /user/nutch/urls/urls1

drwxr-xr-x   - nutch supergroup          0 2010-06-09 20:10 /user/nutch/zklin

 

[nutch@gc01vm13 nutch-1.0]$ bin/nutch crawl urls1 -dir crawldatatest -depth 3 -topN 10

 

Crawl需要用绝对路径,相对路径报错

 

crawl started in: crawldatatest

 

rootUrlDir = urls1

threads = 10

depth = 3

topN = 10

Injector: starting

Injector: crawlDb: crawldatatest/crawldb

Injector: urlDir: urls1

Injector: Converting injected urls to crawl db entries.

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://gc01vm13:9000/user/nutch/urls1

         at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)

         at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)

         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)

         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)

         at org.apache.nutch.crawl.Injector.inject(Injector.java:160)

         at org.apache.nutch.crawl.Crawl.main(Crawl.java:113)

 

Crawl需要用绝对路径,相对路径报错

 

日志分析

 

[nutch@gc01vm13 nutch-1.0]$ bin/nutch crawl /user/nutch/urls/urls1 -dir crawldatatest -depth 3 -topN 10    // crawldatatest是爬虫后的数据的存放位置,相对路径,和nutch-site.xmlSearch.dir一致

crawl started in: crawldatatest         //表明网络蜘蛛的名称

rootUrlDir = /user/nutch/urls/urls1     //待下载数据的列表文件或列表

threads = 10

depth = 3

topN = 10

 

Injector: starting                    //注入下载列表

Injector: crawlDb: crawldatatest/crawldb

Injector: urlDir: /user/nutch/urls/urls1

Injector: Converting injected urls to crawl db entries.  //根据注入的列表生成待下载的地址库

Injector: Merging injected urls into crawl db.   //执行Merge

Injector: done

 

Generator: Selecting best-scoring urls due for fetch.   //判断网页重要性,决定下载顺序

Generator: starting

Generator: segment: crawldatatest/segments/20100611004927  //生成下载结果存储的数据段

Generator: filtering: true

Generator: topN: 10

Generator: Partitioning selected urls by host, for politeness.   //url下载列表按Hadoop中的配置文件slaves中定义的datanode来分配。按host分配。

Generator: done.

 

Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.

Fetcher: starting

Fetcher: segment: crawldatatest/segments/20100611004927  //下载指定网页内容到segment中去

Fetcher: done

 

CrawlDb update: starting   //下载完毕后,更新下载数据库,增加新的下载

CrawlDb update: db: crawldatatest/crawldb

CrawlDb update: segments: [crawldatatest/segments/20100611004927]

CrawlDb update: additions allowed: true

CrawlDb update: URL normalizing: true

CrawlDb update: URL filtering: true

CrawlDb update: Merging segment data into db.

CrawlDb update: done

 

//循环执行下载 第二次

Generator: Selecting best-scoring urls due for fetch.

Generator: starting

Generator: segment: crawldatatest/segments/20100611005051

Generator: filtering: true

Generator: topN: 10

Generator: Partitioning selected urls by host, for politeness.

Generator: done.

 

Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.

Fetcher: starting

Fetcher: segment: crawldatatest/segments/20100611005051

Fetcher: done

 

CrawlDb update: starting

CrawlDb update: db: crawldatatest/crawldb

CrawlDb update: segments: [crawldatatest/segments/20100611005051]

CrawlDb update: additions allowed: true

CrawlDb update: URL normalizing: true

CrawlDb update: URL filtering: true

CrawlDb update: Merging segment data into db.

CrawlDb update: done

 

//循环下载,第三次

Generator: Selecting best-scoring urls due for fetch.

Generator: starting

Generator: segment: crawldatatest/segments/20100611005212

Generator: filtering: true

Generator: topN: 10

Generator: Partitioning selected urls by host, for politeness.

Generator: done.

 

Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.

Fetcher: starting

Fetcher: segment: crawldatatest/segments/20100611005212

Fetcher: done

 

CrawlDb update: starting

CrawlDb update: db: crawldatatest/crawldb

CrawlDb update: segments: [crawldatatest/segments/20100611005212]

CrawlDb update: additions allowed: true

CrawlDb update: URL normalizing: true

CrawlDb update: URL filtering: true

CrawlDb update: Merging segment data into db.

CrawlDb update: done

 

//总共循环depth次,Nutch的局域网模式采用了广度优先策略,把二级页面抓取完毕以后,进行三级页面抓取

LinkDb: starting           //进行网页链接关系分析

LinkDb: linkdb: crawldatatest/linkdb

LinkDb: URL normalize: true    //规范化

LinkDb: URL filter: true        //根据Crawl-urlfiter.txt来过滤

LinkDb: adding segment: hdfs://gc01vm13:9000/user/nutch/crawldatatest/segments/20100611004927

LinkDb: adding segment: hdfs://gc01vm13:9000/user/nutch/crawldatatest/segments/20100611005051

LinkDb: adding segment: hdfs://gc01vm13:9000/user/nutch/crawldatatest/segments/20100611005212

LinkDb: done    //链接分析完毕

 

 

Indexer: starting   //开始创建索引

Indexer: done

Dedup: starting   //网页去重

Dedup: adding indexes in: crawldatatest/indexes

Dedup: done

merging indexes to: crawldatatest/index    //索引合并

Adding hdfs://gc01vm13:9000/user/nutch/crawldatatest/indexes/part-00000

done merging

crawl finished: crawldatatest   //结束

[nutch@gc01vm13 nutch-1.0]$

 

[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -copyToLocal crawldatatest /home/nutch/nutchinstall/

[nutch@gc01vm13 nutch-1.0]$ cd ..

[nutch@gc01vm13 nutchinstall]$ ls

confbak  crawldatatest  filesystem  hadoopscheduler  hadooptmp  nutch-1.0  urls

copyToLocal用的是相对路径

 

生出数据分析

bin/nutch crawl/user/nutch/urls/urls1 -dir crawldatatest -depth 3 -topN 10

从集群拷贝到本地

[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -copyToLocal crawldatatest /home/nutch/nutchinstall/

[nutch@gc01vm13 nutch-1.0]$ cd ..

[nutch@gc01vm13 nutchinstall]$ ls

confbak  crawldatatest  filesystem  hadoopscheduler  hadooptmp  nutch-1.0  urls

  

 

抓取程序在集群用户根目录(/user/nutch)下面建立目录,

[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -ls

Found 3 items

drwxr-xr-x   - nutch supergroup          0 2010-06-11 00:55 /user/nutch/crawldatatest

drwxr-xr-x   - nutch supergroup          0 2010-06-11 00:46 /user/nutch/urls

drwxr-xr-x   - nutch supergroup          0 2010-06-09 20:10 /user/nutch/zklin

 

crawldatatest目录下生成了crawldb,segments,index,indexs,linkdb五个文件

[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -ls /user/nutch/crawldatatest

Found 5 items

drwxr-xr-x   - nutch supergroup          0 2010-06-11 00:53 /user/nutch/crawldatatest/crawldb

drwxr-xr-x   - nutch supergroup          0 2010-06-11 00:55 /user/nutch/crawldatatest/index

drwxr-xr-x   - nutch supergroup          0 2010-06-11 00:54 /user/nutch/crawldatatest/indexes

drwxr-xr-x   - nutch supergroup          0 2010-06-11 00:53 /user/nutch/crawldatatest/linkdb

drwxr-xr-x   - nutch supergroup          0 2010-06-11 00:52 /user/nutch/crawldatatest/segments

 

1) crawldb目录下面存放下载的URL,以及下载的日期,用来页面更新检查时间。

2) linkdb目录存放URL的关联关系,是下载完成后分析时创建的,通过这个关联关系可以实现类似googlepagerank功能。

3) segments目录存储抓取的页面,下面子目录的个数与获取页面的层数有关系,我指定-depth3层,这个目录下就有3层。

  里面有6个子目录

  content,下载页面的内容;

  crawl_fetch,下载URL的状态内容;

  crawl_generate,待下载的URL的集合,在generate任务生成时和下载过程中持续分析出来;

  crawl_parse,存放用来更新crawldb的外部链接库;

  parse_data,存放每个URL解析出来的外部链接和元数据;

  parse_text,存放每个解析过的URL的文本内容;

4) index目录存放符合lucene格式的索引目录,是indexs里所有的索引内容合并后的完整内容,看了一下这里的索引文件和用 lucenedemo做出来的文件名称都不一样,待进一步研究;

5 )indexs目录存放每次下载的索引目录,存放part-0000

 

原创粉丝点击