Nutch-1.x学习笔记
来源:互联网 发布:沈阳seo引擎优化软件 编辑:程序博客网 时间:2024/06/03 12:53
Nutch单步操作
1、<创建种子url>
mkdir -p urls
cd urls
touch seed.txt
echo "http://www.qq.com/">>urls/seed.txt #每行一个种子url
2、<inject>
bin/nutch inject crawl/crawldb urls
3、<generate>
bin/nutch generate crawl/crawldb crawl/segments
4、<fetch>
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
5、<parse>
bin/nutch parse $s1
6、<updatedb>
bin/nutch updatedb crawl/crawldb $s1
7、多次操作3-6步
8、<invertlinks>
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
9、<Indexing into Apache Solr>
bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize
10、<Deleting Duplicates>
/bin/nutch solrdedup http://localhost:8983/solr
11、<Cleaning Solr>
/bin/nutch solrclean crawl/crawldb/ http://localhost:8983/solr
Nutch脚本操作
Usage: crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num Rounds>
-i|--index Indexes crawl results into a configured indexer
-D A Java property to pass to Nutch calls
Seed Dir Directory in which to look for a seeds file
Crawl Dir Directory where the crawl/link/segments dirs are saved
Num Rounds The number of rounds to run this crawl for
Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/ 2
refer from : http://wiki.apache.org/nutch/NutchTutorial
- Nutch-1.x学习笔记
- NUTCH学习笔记汇总
- nutch 学习笔记
- Nutch学习笔记
- Nutch学习笔记二
- Nutch学习笔记三
- Nutch 学习笔记 2
- Nutch 1.3 学习笔记
- Nutch+Solr学习笔记
- Nutch学习笔记
- Nutch 0.7.2 学习笔记
- Nutch 1.3 学习笔记1
- Nutch 1.3 学习笔记2
- Nutch学习笔记1 ---------Inject
- Nutch 1.3 学习笔记2
- Nutch 1.3 学习笔记2
- Nutch 1.3 学习笔记2
- Nutch 1.3 学习笔记1
- 【日常学习】【区间DP+高精】codevs1166 矩阵取数游戏题解
- Chromium浏览器组件设计意图
- JAVA_OPTS
- AX系统金额缩写MST的含义
- 计算机概论总结
- Nutch-1.x学习笔记
- Swift 字典的常用方法
- 题目1 : Beautiful String
- POJ题目3517 And Then There Was One(约瑟夫,公式)
- Struts2学习笔记----阿冬专栏
- Java io nio
- FT:在锁屏界面的上面自己的view
- Android之EditText特殊小技巧
- MySQL存储过程详解