Nutch1.0 日志分析(转)
来源:互联网 发布:c语言字符串最大长度 编辑:程序博客网 时间:2024/06/06 09:54
Hadoop集群创建文件
[nutch@gc01vm13 /]$ cd ./home/nutch/nutchinstall/nutch-1.0/
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -ls
Found 1 items
drwxr-xr-x - nutch supergroup 0 2010-06-09 20:10 /user/nutch/zklin
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -mkdir crawldatatest //mkdir 的地址相对于/user/nutch
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -ls
Found 2 items
drwxr-xr-x - nutch supergroup 0 2010-06-11 00:40 /user/nutch/crawldatatest
drwxr-xr-x - nutch supergroup 0 2010-06-09 20:10 /user/nutch/zklin
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -mkdir urls
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -ls
Found 3 items
drwxr-xr-x - nutch supergroup 0 2010-06-11 00:40 /user/nutch/crawldatatest
drwxr-xr-x - nutch supergroup 0 2010-06-11 00:45 /user/nutch/urls
drwxr-xr-x - nutch supergroup 0 2010-06-09 20:10 /user/nutch/zklin
[nutch@gc01vm13 nutchinstall]$ mkdir urls //先在本地创建urls文件夹
[nutch@gc01vm13 nutchinstall]$ cd ./urls/
[nutch@gc01vm13 urls]$ ls
[nutch@gc01vm13 urls]$ vim urls1 //写url入口地址,urls1
[nutch@gc01vm13 urls]$ cd ..
[nutch@gc01vm13 nutchinstall]$ cd ./nutch-1.0/
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -copyFromLocal /home/nutch/nutchinstall/urls/urls1 urls //从本地拷贝到集群 集群是相对路径
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -lsr //在集群上查看urls1
drwxr-xr-x - nutch supergroup 0 2010-06-11 00:40 /user/nutch/crawldatatest
drwxr-xr-x - nutch supergroup 0 2010-06-11 00:46 /user/nutch/urls
-rw-r--r-- 2 nutch supergroup 31 2010-06-11 00:46 /user/nutch/urls/urls1
drwxr-xr-x - nutch supergroup 0 2010-06-09 20:10 /user/nutch/zklin
[nutch@gc01vm13 nutch-1.0]$ bin/nutch crawl urls1 -dir crawldatatest -depth 3 -topN 10
Crawl需要用绝对路径,相对路径报错
crawl started in: crawldatatest
rootUrlDir = urls1
threads = 10
depth = 3
topN = 10
Injector: starting
Injector: crawlDb: crawldatatest/crawldb
Injector: urlDir: urls1
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://gc01vm13:9000/user/nutch/urls1
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
at org.apache.nutch.crawl.Injector.inject(Injector.java:160)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:113)
Crawl需要用绝对路径,相对路径报错
日志分析
[nutch@gc01vm13 nutch-1.0]$ bin/nutch crawl /user/nutch/urls/urls1 -dir crawldatatest -depth 3 -topN 10 // crawldatatest是爬虫后的数据的存放位置,相对路径,和nutch-site.xml的Search.dir一致
crawl started in: crawldatatest //表明网络蜘蛛的名称
rootUrlDir = /user/nutch/urls/urls1 //待下载数据的列表文件或列表
threads = 10
depth = 3
topN = 10
Injector: starting //注入下载列表
Injector: crawlDb: crawldatatest/crawldb
Injector: urlDir: /user/nutch/urls/urls1
Injector: Converting injected urls to crawl db entries. //根据注入的列表生成待下载的地址库
Injector: Merging injected urls into crawl db. //执行Merge
Injector: done
Generator: Selecting best-scoring urls due for fetch. //判断网页重要性,决定下载顺序
Generator: starting
Generator: segment: crawldatatest/segments/20100611004927 //生成下载结果存储的数据段
Generator: filtering: true
Generator: topN: 10
Generator: Partitioning selected urls by host, for politeness. //将url下载列表按Hadoop中的配置文件slaves中定义的datanode来分配。按host分配。
Generator: done.
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: crawldatatest/segments/20100611004927 //下载指定网页内容到segment中去
Fetcher: done
CrawlDb update: starting //下载完毕后,更新下载数据库,增加新的下载
CrawlDb update: db: crawldatatest/crawldb
CrawlDb update: segments: [crawldatatest/segments/20100611004927]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
//循环执行下载 第二次
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawldatatest/segments/20100611005051
Generator: filtering: true
Generator: topN: 10
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: crawldatatest/segments/20100611005051
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawldatatest/crawldb
CrawlDb update: segments: [crawldatatest/segments/20100611005051]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
//循环下载,第三次
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawldatatest/segments/20100611005212
Generator: filtering: true
Generator: topN: 10
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: crawldatatest/segments/20100611005212
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawldatatest/crawldb
CrawlDb update: segments: [crawldatatest/segments/20100611005212]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
//总共循环depth次,Nutch的局域网模式采用了广度优先策略,把二级页面抓取完毕以后,进行三级页面抓取
LinkDb: starting //进行网页链接关系分析
LinkDb: linkdb: crawldatatest/linkdb
LinkDb: URL normalize: true //规范化
LinkDb: URL filter: true //根据Crawl-urlfiter.txt来过滤
LinkDb: adding segment: hdfs://gc01vm13:9000/user/nutch/crawldatatest/segments/20100611004927
LinkDb: adding segment: hdfs://gc01vm13:9000/user/nutch/crawldatatest/segments/20100611005051
LinkDb: adding segment: hdfs://gc01vm13:9000/user/nutch/crawldatatest/segments/20100611005212
LinkDb: done //链接分析完毕
Indexer: starting //开始创建索引
Indexer: done
Dedup: starting //网页去重
Dedup: adding indexes in: crawldatatest/indexes
Dedup: done
merging indexes to: crawldatatest/index //索引合并
Adding hdfs://gc01vm13:9000/user/nutch/crawldatatest/indexes/part-00000
done merging
crawl finished: crawldatatest //结束
[nutch@gc01vm13 nutch-1.0]$
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -copyToLocal crawldatatest /home/nutch/nutchinstall/
[nutch@gc01vm13 nutch-1.0]$ cd ..
[nutch@gc01vm13 nutchinstall]$ ls
confbak crawldatatest filesystem hadoopscheduler hadooptmp nutch-1.0 urls
copyToLocal用的是相对路径
生出数据分析
bin/nutch crawl/user/nutch/urls/urls1 -dir crawldatatest -depth 3 -topN 10
从集群拷贝到本地
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -copyToLocal crawldatatest /home/nutch/nutchinstall/
[nutch@gc01vm13 nutch-1.0]$ cd ..
[nutch@gc01vm13 nutchinstall]$ ls
confbak crawldatatest filesystem hadoopscheduler hadooptmp nutch-1.0 urls
抓取程序在集群用户根目录(/user/nutch)下面建立目录,
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -ls
Found 3 items
drwxr-xr-x - nutch supergroup 0 2010-06-11 00:55 /user/nutch/crawldatatest
drwxr-xr-x - nutch supergroup 0 2010-06-11 00:46 /user/nutch/urls
drwxr-xr-x - nutch supergroup 0 2010-06-09 20:10 /user/nutch/zklin
在crawldatatest目录下生成了crawldb,segments,index,indexs,linkdb五个文件
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -ls /user/nutch/crawldatatest
Found 5 items
drwxr-xr-x - nutch supergroup 0 2010-06-11 00:53 /user/nutch/crawldatatest/crawldb
drwxr-xr-x - nutch supergroup 0 2010-06-11 00:55 /user/nutch/crawldatatest/index
drwxr-xr-x - nutch supergroup 0 2010-06-11 00:54 /user/nutch/crawldatatest/indexes
drwxr-xr-x - nutch supergroup 0 2010-06-11 00:53 /user/nutch/crawldatatest/linkdb
drwxr-xr-x - nutch supergroup 0 2010-06-11 00:52 /user/nutch/crawldatatest/segments
1) crawldb目录下面存放下载的URL,以及下载的日期,用来页面更新检查时间。
2) linkdb目录存放URL的关联关系,是下载完成后分析时创建的,通过这个关联关系可以实现类似google的pagerank功能。
3) segments目录存储抓取的页面,下面子目录的个数与获取页面的层数有关系,我指定-depth是3层,这个目录下就有3层。
里面有6个子目录
content,下载页面的内容;
crawl_fetch,下载URL的状态内容;
crawl_generate,待下载的URL的集合,在generate任务生成时和下载过程中持续分析出来;
crawl_parse,存放用来更新crawldb的外部链接库;
parse_data,存放每个URL解析出来的外部链接和元数据;
parse_text,存放每个解析过的URL的文本内容;
4) index目录存放符合lucene格式的索引目录,是indexs里所有的索引内容合并后的完整内容,看了一下这里的索引文件和用 lucenedemo做出来的文件名称都不一样,待进一步研究;
5 )indexs目录存放每次下载的索引目录,存放part-0000;
- Nutch1.0 日志分析(转)
- Nutch1.0 crawl分析(转)
- Nutch1.0 crawl分析(转)
- Nutch1.0 crawl分析(转)
- Nutch1.0源码分析-----抓取部分
- Nutch1.0源码分析-----抓取部分
- Nutch1.0源码分析-----抓取部分
- Nutch1.0 Crawl整体代码分析
- Nutch1.7Injector源代码分析
- Nutch1.7Generator源代码分析
- Nutch1.7Fetcher源代码分析
- Nutch1.7ParseSegment源代码分析
- nutch1.0各种命令
- Nutch1
- Nutch1.0和eclipse配置(详细版)
- Nutch1.7基本工作流程分析
- ELK 日志分析系统(转)
- nutch安装配置 tomcat6.0+nutch1.2安装配置(原创)
- WINCE软音量调节(二)
- 软件开发网站
- 博客很好,继续搞搞
- 关于glew安装
- android NDK开发中 在native代码试用log
- Nutch1.0 日志分析(转)
- 编写高质量代码--Web前端开发修炼之道
- LINUX中的rsync
- 数据互的装换问题
- 三十六计
- 利用CMD 批处理 实现自动更改ip
- Java时间格式转换大全
- 资深电子工程师的技术分享:接地方法
- 图片从数据流转换为字符型,方便网络传输 base64转换