Nuth | Hadoop完全分布式运行学习笔记

来源：互联网发布：海迅拆单软件正版价格编辑：程序博客网时间：2024/05/17 04:34

原始URL：

hdfs://10.66.27.18.:9000/user/hadoop/urldir url.txt -->http://blog.tianya.cn

hdfs://10.66.27.18.:9000/user/hadoop/urldir url2.txt -->http://bbs.tianya.cn

直接生成：bin/nutch crawl urldir -dir crawldata -depth 3 -topN 5

执行MapReduce过程，大约会有12个MapReduce作业。

单步调试：

inject过程：

inject注入： bin/nutch inject crawldatatest/crawldb urldir

输出以便查看：bin/nutch readdb crawldatatest/crawldb -dump tmpdata/test/crawldb/crawldb_dump

查看inject的产生结果：

查看：bin/hadoop fs -cat tmpdata/test/crawldb/crawldb_dump/part-00000

结果：

http://blog.tianya.cn/Version: 7
Status: 1 (db_unfetched)
Fetch time: Fri Sep 13 10:57:28 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:

http://www.163.com/Version: 7
Status: 1 (db_unfetched)
Fetch time: Fri Sep 13 10:57:28 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:

generate过程：

generate产生：bin/nutch generate crawldatatest/crawldb crawldatatest/segments
generate过程产生结果查看：
产生：bin/nutch readseg -dump crawldatatest/segments/20130913122422 tmpdata/test/segments/20130913122422_dump -nocontent -nofetch -noparse -noparsedata – noparsetext
查看：bin/hadoop fs -cat tmpdata/*/*/*/dump
结果：
Recno:: 0
URL:: http://blog.tianya.cn/

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Fri Sep 13 10:57:28 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1379046213002

ParseText::
天涯博客_有见识的人都在此博客首页社会民生国际观察娱乐体育文化历史生活情感财经股市美食旅游最新博文博客达人博客总排行 01 等待温暖的小狐狸 44887595 02 潘文伟 34654676 03 travelisliving 30676532 04 股市掘金 28472831 05 crystalkitty 26283927 06 yuwenyufen 24880887 07 水莫然 24681174 08 李泽辉 22691445 09 钟巍巍 19226129 10 别境 17752691 11 微笑的说我很幸 15912882 12 尤宇 15530802 13 sundaes 14961321 14 郑渝川 14219498 15 黑花黄 13174656 博文排行 01 公益电影《校车》是献给学生 02 《神奇》终极预告：游戏、运 03 陆克文败选在党内受排挤欲 04 奥巴马称若叙交出化武或可免 05 美国“9?11”事件产生有害物 06 NBA众多球星感言“911恐怖袭 07 是否合理？摸底排查机顶盒广 08 撒贝宁23岁新校花女友美艳近 09 富婆把我当性保姆和姐妹共享 10 一句话的感悟社会排行国际排行 01 谁让中国社会动荡不安？ 02 总后勤部副部长谷俊山中将有 03 时代悲歌：老幼病残孕，坑蒙 04 上海高院“招妓门”是别有用 05 再说下河南与中国的被妖魔化 06 涉黄法官有多少枉法审判？ 07 4法官集体招嫖事件中的意外 08 【六朝闲话】王小石们为什么 09 怀胎七月妇被强制引产致精神 10 解决民众就能化解危机？ 01 梁石川：奥巴马与普京呕气谁 02 王猛：卖淫合法的德国样本 03 中国烟草罪孽深重：研究称全 04 不能低估美国武装日本的决心 05 剩女是中国为了推高房价的一 06 笑尿了！两个屌丝上电视，结 07 马来西亚新年快乐 08 看，这些帖子多离谱！ 09 腐败不作为的地方政府和横行 10 中国人偷渡欧洲团伙75人被抓娱乐排行体育排行 01 绝色性感美女 02 盘点做小三失败而背负骂名的 03 《一夜惊喜》：范冰冰的大胆 04 曝陈奕迅患人群恐惧症赴英 05 陈晓扮米奇萌翻全场卖萌推 06 梦鸽控告杨女士主动卖.淫酒 07 宋祖英罕见少女旧照 “辣妹 08 吴奇隆恶搞孙俪蜡像四爷穿 09 林志颖儿子KIMI呆萌可爱近照 10 陈冠希旧爱疑为富二代诞私生 01 ESPN：霍奇森证实鲁尼并没受 02 霍华德缘何成为最被憎恨的人 03 创造与守护：弗格森曼联故事 04 弗格森致英超公开信 05 库班：愿意无限期续约德克 06 科比詹姆斯谁是当今单挑王？ 07 英西联赛来袭英超三人行西 08 浓墨重彩焦点之战 2013赛季C 09 中国百米飞人引世人注目 10 全运男篮分组出炉辽宁避开文化排行历史排行 01 侯夫人：玉颜不及寒鸦色（三 02 乾嘉余脉杂钞（8） 03 用性心理解析某些官员的私德 04 怀念民国其实是怀念自由 05 邓刚：海是我的血，也是我的 06 光线，一根发声的琴弦 07 在场叙事 08 我们的雨巷 09 佛学六道中的畜牲道 10 不问法律问风水？ 01 旧年人物之：楚狂先生 02 乾嘉余脉杂钞（7） 03 侯夫人：玉颜不及寒鸦色（三 04 毛泽东的十个“最后一次” 05 程序之争：上海检方为何敢于 06 第三卷《通经致用的年代》九 07 刺杀袁世凯的民国传奇女律师 08 刺客列传（十九）——古代美 09 1946，中国东北民众促苏军撤 10 非典型性“挺武派”名相—— 生活排行情感排行 01 晨起水的4大好处喝好养生美 02 纸条，在风中飘摇 03 匡诗保 04 刘墉推荐的旅游百忌 05 （风子闹厨房）黑木耳炒山药 06 舞动的雨 07 随感两则 08 花蕊 09 陌生人 10 到此一游 01 年方二八之坐二八望二八 02 年方二八之剩女应无恙 03 中国官员为什么“习惯性说谎 04 湖北宣恩当街杀警事件调查 05 独自莫凭栏，一帘风月闲。 06 前卫妻子背着我去拍大尺度写 07 关注并声援王悦 08 她，12岁，，她是全村第一个 09 请放过叶海燕 10 成都之行略记财经排行股市排行 01 【金业银锭】：6.2周评－－ 02 《黄金航海记》序言 03 客户会向朋友推荐你的公司吗 04 [保险] 险为人知——选购保 05 中国为什么陷入巨额债务泥潭 06 地价10年20倍广州楼价涨势 07 计生委向全国小朋友们的郑重 08 正视现实的经济思想 09 为何一个椅子能值10万块 10 周景朗：5月27日晚间现货黄 01 磁区操作微观定位 02 【下午点位及方向】 03 稳固底部后再谈反弹 04 周末行情如何演绎？ 05 调整的第一支撑是2050点 06 阿里造智能TV联盟究竟能否 07 盘面继续向上有难度 08 穿头破脚星，黑周四概率多大 09 银行股能否崛起是最后的希望 10 中国手游难逃 “暴毙”的命美食排行旅游排行 01 （风子闹厨房）木糠杯 02 清蒸海鱼老虎斑 03 鸭血粉丝汤——耳语20130828 04 【懒人厨房】消夏小吃——手 05 （风子闹厨房）芒果慕斯杯 06 家乡的美食（一） 07 面条煎蛋饼 08 牛奶花生甜汤 09 （风子闹厨房）苔香曲奇 10 香辣小排骨 01 光盘行动系列篇之十七——婚 02 海口西湖 03 萧艾的儿子在桃树上 04 迪拜......机场，哈哈（旧片 05 鲇鱼效应青蛙现象羊群效应 06 独行曼谷－密集恐惧北揽鳄鱼 07 游迹–上海川沙：内史第、黄 08 鼓浪屿之三 09 宁夏风光，任何一种美丽的遇 10 雨天独家漂流记 1 2 3 4 撒贝宁23岁新校花女友美艳近照（组图）路边的野花采不完鲁迅故居印象梦鸽新版《祥林嫂》何以比原版更绝？母亲谎称不是亲妈是培养孩独立or孤立？激励孩子的方法很多，没有必要为了孩子“有出息”而牺牲亲情，亲情是孩子成长中必不少的元素，缺少这种元素... [详细] 腐败和滥权是“吃空饷”的根源近日，媒体曝光河南周口2个月查出近6千人吃空饷，每年白白吃掉财政开支一个多亿！一位交流到周口市任职的官... [详细] 连发爆炸案：戾气郁结的社会应有法可医三起触目惊心的爆炸事件，给人民生活带来了巨大重创。在这三起爆炸事故中，河南新乡和广州白云爆炸原因相... [详细] 【社论】政府花钱“买好话”反衬民意“无好话” 【质疑】桑拿女被男友虐待两天不呼救有何隐情？【热议】杜绝体制“吃空饷”现象之根本靠什么【见解】中国不必对日本申奥恐惧到坐立不安专栏博文 <<申请加入大V近黄昏？毛牧青娱乐圈10位女星成名前后玉照大PK 渲染流年E 《南行记》的版本及流变灯下醉 9.13【下午点位及方向】买卖点吧2008 金雕轶事四川蒋蓝 9.方城大战(羊虹诗歌) 圆圆梦圆圆大力神在召唤风抹残阳《每日邮报》：鲁尼5成机会出战曼城天空浮沉的鱼《每日邮报》：费莱尼有望周末迎首秀天空浮沉的鱼有多少评比表彰功能被异化被扭曲？花玉喜 “美丽乡村”需要政府放权伍少安名博推荐红豆火警丰雪飘谢不谦金满楼少林修女长沙艾敏冉云飞寒枫化雨雨润de云温墨黑纸白还是定风波温柔恶女缠绕夜色涅阳小生蓝田玉烟古月轩主1 六盘水评论湖畔小子 ayuan566 章半仙李承鹏烟灰醉余晖周禄宝云歇鸢评论杨涛飞一扬老海博客说真话好难童大焕安歌儿陈彤周其仁党国英皮海洲信力建蒋丰张鸣云无心陶短房 ESPN詹俊社会民生更多崔永元与方舟子的转基因之争近日，方舟子发起活动鼓励网友品尝转基因玉米，称应当创造条件让国人可以天天吃转基因食品。崔永元则反对其 [详细] 张曙光多少受贿款用于参评院士？花玉喜历史之镜：总督家的“功课单” 游宇明【六朝沉思】大陆人民的归宿六朝烟水满金陵对三组《爱是你我》演唱的看法毛牧青人生百味：94.伞圆圆梦圆圆谈谈王捕头、薛蛮子和百度时代尖兵国际观察更多这个世界在冷战与热战边缘飘摇稍不留神，叙利亚就有可能成为新的“萨拉热窝”，在此地开启一场规模和时间远超预计的战争。 [详细] 给美国挖坑的竟是这个国家？西山隐智 SOS SOS 奥巴马紧急呼救紧急呼救！叶仲录

Recno:: 1
URL:: http://www.163.com/

ParseText::
网易应用网易新闻网易云音乐网易云阅读有道云笔记网易花田网易公开课网易彩票有道词典
CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Fri Sep 13 10:57:28 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1379046213002

fetch过程：

fetch产生：bin/nutch fetch crawldatatest/segments/20130913122422
fetch过程产生结果查看：
产生：bin/nutch readseg -dump crawldatatest/segments/20130913122422 tmpdata/test/segments/20130913122422_dump_fetch -nocontent -nogenerate -noparse - noparsedata –noparsetext
查看： bin/hadoop fs -cat tmpdata/*/*/*fetch*/dump
结果：
Recno:: 0
URL:: http://blog.tianya.cn/
ParseText::，，，，，，，，，，，，，，，，，，，，
CrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Fri Sep 13 12:35:02 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1379046213002Content-Type: text/html_pst_: success(1), lastModified=0

Recno:: 1
URL:: http://www.163.com/

CrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Fri Sep 13 12:35:05 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1379046213002Content-Type: text/html_pst_: success(1), lastModified=0
ParseText::
网易应用网易新闻网易云音乐网易云阅读有道云笔记网易花田网易公开课网易彩票有道词典

parse过程：

parse产生：bin/nutch parse crawldatatest/segments/20130913122422
parse过程产生结果查看：
产生：bin/nutch readseg -dump crawldatatest/segments/20130913122422 tmpdata/test/segments/20130913122422_dump_parse -nocontent -nogenerate -nofetch - noparsedata –noparsetext
查看：bin/hadoop fs -cat tmpdata/*/*/*parse*/dump | more
结果：
Recno:: 0
URL:: http://3g.163.com/links/4145

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Fri Sep 13 12:41:48 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.125
Signature: null
Metadata:
................................
Recno:: 64
URL:: http://zjzhoulubao.blog.tianya.cn/

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Fri Sep 13 12:41:48 CST 2013
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.01754386
Signature: null
Metadata:

可见共解析出了65个url.与下面的统计过程正好吻合。

updatedb过程：

updatedb命令：bin/nutch updatedb crawldatatest/crawldb -dir crawldatatest/segments
update效果查看：bin/nutch readdb crawldatatest/crawldb -stats
13/09/13 12:49:34 INFO crawl.CrawlDbReader: TOTAL urls:65
13/09/13 12:49:34 INFO crawl.CrawlDbReader: retry 0:65
13/09/13 12:49:34 INFO crawl.CrawlDbReader: min score:0.017
13/09/13 12:49:34 INFO crawl.CrawlDbReader: avg score:0.061092306
13/09/13 12:49:34 INFO crawl.CrawlDbReader: max score:1.0
13/09/13 12:49:34 INFO crawl.CrawlDbReader: status 1 (db_unfetched):63
13/09/13 12:49:34 INFO crawl.CrawlDbReader: status 2 (db_fetched):2
13/09/13 12:49:34 INFO crawl.CrawlDbReader: CrawlDb statistics: done
可以看到TOTAL urls 由开始的2个变成了63个。

查看content的内容：

产生content：content是在执行parse命令的时候产生的：
生成content：
bin/nutch readseg -dump crawldatatest/segments/20130913122422 tmpdata/test/segments/20130913122422_dump_content -noparse -nogenerate -nofetch -noparsedata –noparsetext
查看content：bin/hadoop fs -cat tmpdata/*/*/*content*/dump | more
content的内容：

Recno:: 0
URL:: http://blog.tianya.cn/
ParseText::，，，，，，，，，，，，，，，，，
Content::
Version: -1
url: http://blog.tianya.cn/
base: http://blog.tianya.cn/
contentType: text/html
metadata: Date=Fri, 13 Sep 2013 04:34:58 GMT Vary=Accept-Encoding Expires=Thu, 01 Nov
2012 10:00:00 GMT Content-Encoding=gzip nutch.crawl.score=1.0 _fst_=33 nutch.segment.n
ame=20130913122422 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx
Cache-Control=no-cache Pragma=no-cache
Content:
<!DOCTYPE HTML>
<html>
......这里抓取的是html的网页代码
</html>

查看parse-data的内容：

生成：bin/nutch readseg -dump crawldatatest/segments/20130913122422 tmpdata/test/segments/20130913122422_dump_parse_data -noparse -nogenerate -nofetch -content –noparsetext
查看：bin/hadoop fs -cat tmpdata/*/*/*parse_data*/dump | more
内容：
Recno:: 0
URL:: http://blog.tianya.cn/

ParseText::
Outlinks: 57
outlink: toUrl: http://blog.tianya.cn/blog/society anchor: 社会民生
outlink: toUrl: http://blog.tianya.cn/blog/international anchor: 国际观察
outlink: toUrl: http://blog.tianya.cn/blog/ent anchor: 娱乐
outlink: toUrl: http://blog.tianya.cn/blog/sports anchor: 体育
outlink: toUrl: http://blog.tianya.cn/blog/culture anchor: 文化
outlink: toUrl: http://blog.tianya.cn/blog/history anchor: 历史
outlink: toUrl: http://blog.tianya.cn/blog/life anchor: 生活
outlink: toUrl: http://blog.tianya.cn/blog/emotion anchor: 情感
outlink: toUrl: http://blog.tianya.cn/blog/finance anchor: 财经
outlink: toUrl: http://blog.tianya.cn/blog/stock anchor: 股市
outlink: toUrl: http://blog.tianya.cn/blog/food anchor: 美食
outlink: toUrl: http://blog.tianya.cn/blog/travel anchor: 旅游
...............................................................

查看parse-text的内容：

生成：bin/nutch readseg -dump crawldatatest/segments/20130913122422 tmpdata/test/segments/20130913122422_dump_parse_text -noparse -nogenerate -nofetch -content –noparsedata

查看：bin/hadoop fs -cat tmpdata/*/*/*parse_text*/dump | more
内容：
Recno:: 0
URL:: http://blog.tianya.cn/

Content::
Version: -1
url: http://blog.tianya.cn/
base: http://blog.tianya.cn/
contentType: text/html
metadata: Date=Fri, 13 Sep 2013 04:34:58 GMT Vary=Accept-Encoding Expires=Thu, 01 Nov
2012 10:00:00 GMT Content-Encoding=gzip nutch.crawl.score=1.0 _fst_=33 nutch.segment.n
ame=20130913122422 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx
Cache-Control=no-cache Pragma=no-cache
Content:

<!DOCTYPE HTML>
<html>
</html>
ParseData::
Version: 5
Status: success(1,0)
Title: 天涯博客_有见识的人都在此
Outlinks: 57
outlink: toUrl: http://blog.tianya.cn/blog/society anchor: 社会民生
outlink: toUrl: http://blog.tianya.cn/blog/international anchor: 国际观察
outlink: toUrl: http://blog.tianya.cn/blog/ent anchor: 娱乐
outlink: toUrl: http://blog.tianya.cn/blog/sports anchor: 体育
outlink: toUrl: http://blog.tianya.cn/blog/culture anchor: 文化
outlink: toUrl: http://blog.tianya.cn/blog/history anchor: 历史
outlink: toUrl: http://blog.tianya.cn/blog/life anchor: 生活
..................................................................
Recno:: 1
URL:: http://www.163.com/

ParseData::
Version: 5
Status: success(1,0)
Title: 网易
Outlinks: 8
outlink: toUrl: http://m.163.com/ anchor: 应用
outlink: toUrl: http://m.163.com/newsapp/ anchor: 网易新闻
outlink: toUrl: http://music.163.com/ anchor: 网易云音乐
outlink: toUrl: http://yuedu.163.com/ anchor: 网易云阅读
...............................................

bin/nutch | grep merge 合并命令

bin/nutch mergesegs crawldata/segments_merge -dir crawldata/segments

invertlinks命令：

bin/nutch invertlinks crawldatatest/linkdb -dir crawldatatest/segments
查看invertlinks的产生结果：
产生：bin/nutch readlinkdb crawldatatest/linkdb -dump tmpdata/test/linkdb/linkdb_dump
查看：bin/hadoop fs -cat tmpdata/*/*/*linkdb*/part-*
结果：
http://3g.163.com/links/4145Inlinks:
fromUrl: http://www.163.com/ anchor: 网易花田

http://aimin_001.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 长沙艾敏

http://anger.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 安歌儿

http://ayuan565656.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: ayuan566

http://caipiao.163.com/mobile/client_cp.jsp Inlinks:
fromUrl: http://www.163.com/ anchor: 网易彩票

http://chaoraoyese.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 缠绕夜色

http://chentongzl.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 陈彤

http://chidezhenxiang.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 云无心

http://coco1918.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 六盘水评论

http://dangguoying.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 党国英

http://dlhaoyiong.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 老海博客

http://espnzhanjun.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: ESPN詹俊

http://fxing.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 丰雪飘

http://haishidingfengbo.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 还是定风波

http://hupanxiaozizl.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 湖畔小子

http://huyang4681.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 古月轩主1

http://jiangfeng2012.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 蒋丰

http://jianhua1962.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 飞一扬

http://jinmanlou.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 金满楼

http://langyaoyuan88.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 说真话好难

http://liantianyuyan.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 蓝田玉烟

http://lichengpeng.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 李承鹏

http://lishiba1.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 章半仙

http://m.163.com/Inlinks:
fromUrl: http://www.163.com/ anchor: 应用

http://m.163.com/newsapp/Inlinks:
fromUrl: http://www.163.com/ anchor: 网易新闻

http://moheizhibai.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 墨黑纸白

http://music.163.com/Inlinks:
fromUrl: http://www.163.com/ anchor: 网易云音乐

http://nieyangxiaosheng.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 涅阳小生

http://note.youdao.com/Inlinks:
fromUrl: http://www.163.com/ anchor: 有道云笔记

http://open.163.com/Inlinks:
fromUrl: http://www.163.com/ anchor: 网易公开课

http://phoenixtvleiyu.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 红豆火警

http://pihaizhou.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 皮海洲

http://shaolinxiunv.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 少林修女

http://shishiluntan_love.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 寒枫化雨

http://syhsunyanhui.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 烟灰醉余晖

http://taoduanfang2012.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 陶短房

http://tongdahuan.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 童大焕

http://tufeiwangshanmao.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 冉云飞

http://wr_en.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 温柔恶女

http://www.tianya.cn/mobileInlinks:
fromUrl: http://blog.tianya.cn/ anchor:

http://xieqian.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 谢不谦

http://xinlijian.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 信力建

http://yangtaopinglun.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 评论杨涛

http://yuedu.163.com/Inlinks:
fromUrl: http://www.163.com/ anchor: 网易云阅读

http://yunxie_88.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 云歇鸢

http://yurunyunwen.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 雨润de云温

http://zhangmingzhuanlan.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 张鸣

http://zhouqiren.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 周其仁

http://zjzhoulubao.blog.tianya.cn/Inlinks:
fromUrl: http://blog.tianya.cn/ anchor: 周禄宝

parsechecker命令： bin/nutch parsechecker http://www.baidu.com

13/09/13 13:12:37 INFO parse.ParserChecker: fetching: http://www.baidu.com
13/09/13 13:12:37 INFO plugin.PluginRepository: Plugins: looking in: /home/hadoop/hadoop-hadoop/hadoop-unjar3881248761965128726/classes/plugins
13/09/13 13:12:38 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
13/09/13 13:12:38 INFO plugin.PluginRepository: Registered Plugins:
13/09/13 13:12:38 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints)
13/09/13 13:12:38 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic)
13/09/13 13:12:38 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html)
13/09/13 13:12:38 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic)
13/09/13 13:12:38 INFO plugin.PluginRepository: HTTP Framework (lib-http)
13/09/13 13:12:38 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass)
13/09/13 13:12:38 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex)
13/09/13 13:12:38 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http)
13/09/13 13:12:38 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex)
13/09/13 13:12:38 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika)
13/09/13 13:12:38 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic)
13/09/13 13:12:38 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml)
13/09/13 13:12:38 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor)
13/09/13 13:12:38 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter)
13/09/13 13:12:38 INFO plugin.PluginRepository: Registered Extension-Points:
13/09/13 13:12:38 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
13/09/13 13:12:38 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol)
13/09/13 13:12:38 INFO plugin.PluginRepository: Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
13/09/13 13:12:38 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter)
13/09/13 13:12:38 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
13/09/13 13:12:38 INFO plugin.PluginRepository: HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
13/09/13 13:12:38 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser)
13/09/13 13:12:38 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
13/09/13 13:12:38 INFO http.Http: http.proxy.host = null
13/09/13 13:12:38 INFO http.Http: http.proxy.port = 8080
13/09/13 13:12:38 INFO http.Http: http.timeout = 10000
13/09/13 13:12:38 INFO http.Http: http.content.limit = 65536
13/09/13 13:12:38 INFO http.Http: http.agent = Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36/Nutch-1.6
13/09/13 13:12:38 INFO http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
13/09/13 13:12:38 INFO http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
13/09/13 13:12:38 INFO conf.Configuration: found resource parse-plugins.xml at file:/home/hadoop/hadoop-hadoop/hadoop-unjar3881248761965128726/parse-plugins.xml
13/09/13 13:12:39 INFO crawl.SignatureFactory: Using Signature impl: org.apache.nutch.crawl.MD5Signature
13/09/13 13:12:39 INFO parse.ParserChecker: parsing: http://www.baidu.com
13/09/13 13:12:39 INFO parse.ParserChecker: contentType: text/html
13/09/13 13:12:39 INFO parse.ParserChecker: signature: de2214c0120f01f00cb1b2c99f193057
---------
Url
---------------
http://www.baidu.com
---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: 百度一下，你就知道
Outlinks: 30
outlink: toUrl: http://www.baidu.com/gaoji/preferences.html anchor: 搜索设置
outlink: toUrl: https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F anchor: 登录
outlink: toUrl: https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F anchor: 注册
outlink: toUrl: http://www.baidu.com/img/bdlogo.gif anchor:
outlink: toUrl: http://news.baidu.com anchor: 新?闻
outlink: toUrl: http://tieba.baidu.com anchor: 贴?吧
outlink: toUrl: http://zhidao.baidu.com anchor: 知?道
outlink: toUrl: http://music.baidu.com anchor: 音?乐
outlink: toUrl: http://image.baidu.com anchor: 图?片
outlink: toUrl: http://v.baidu.com anchor: 视?频
outlink: toUrl: http://map.baidu.com anchor: 地?图
outlink: toUrl: http://www.baidu.com# anchor: 手写
outlink: toUrl: http://www.baidu.com# anchor: 拼音
outlink: toUrl: http://www.baidu.com# anchor: 关闭
outlink: toUrl: http://baike.baidu.com anchor: 百科
outlink: toUrl: http://wenku.baidu.com anchor: 文库
outlink: toUrl: http://www.hao123.com anchor: hao123
outlink: toUrl: http://www.baidu.com/more/ anchor: 更多>>
outlink: toUrl: http://www.baidu.com/ anchor: 把百度设为主页
outlink: toUrl: http://www.baidu.com/cache/sethelp/index.html anchor: 把百度设为主页
outlink: toUrl: http://liulanqi.baidu.com/ps.php anchor: 安装百度浏览器
outlink: toUrl: http://e.baidu.com/?refer=888 anchor: 加入百度推广
outlink: toUrl: http://top.baidu.com anchor: 搜索风云榜
outlink: toUrl: http://home.baidu.com anchor: 关于百度
outlink: toUrl: http://ir.baidu.com anchor: About Baidu
outlink: toUrl: http://www.baidu.com/duty/ anchor: 使用百度前必读
outlink: toUrl: http://www.baidu.com/cache/global/img/gs.gif anchor:
outlink: toUrl: http://s1.bdstatic.com/r/www/cache/static/global/js/home_f813a739.js anchor:
outlink: toUrl: http://s1.bdstatic.com/r/www/cache/static/global/js/tangram-1.3.4c1.0_07038476.js anchor:
outlink: toUrl: http://s1.bdstatic.com/r/www/cache/static/user/js/u_ec0ebfe1.js anchor:
Content Metadata: Content-Length=4408 Expires=Fri, 13 Sep 2013 05:12:37 GMT Set-Cookie=BAIDUID=6D9168DE43162206106A095B0D79C9F2:FG=1; expires=Fri, 13-Sep-43 05:12:37 GMT; path=/; domain=.baidu.com Connection=Close Server=BWS/1.0 Cache-Control=private Date=Fri, 13 Sep 2013 05:12:37 GMT BDQID=0xad50667e05149c9b P3P=CP=" OTI DSP COR IVA OUR IND COM " Content-Encoding=gzip BDPAGETYPE=1 Content-Type=text/html;charset=utf-8 BDUSERID=0
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8

nutch命令行：

crawl one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)（爬虫）
readdb read / dump crawl db（）
mergedb merge crawldb-s, with optional filtering（）
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
solrindex run the solr indexer on parsed segments and linkdb
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
domainstats calculate domain statistics from crawldb
webgraph generate a web graph from existing segments
linkrank run a link analysis program on the generated web graph
scoreupdater updates the crawldb with linkrank scores
nodedumper dumps the web graph's node scores
plugin load a plugin and run one of its classes main()
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
6大命令：read merge

Nuth | Hadoop完全分布式运行 学习笔记