nutch应用-合并Crawl
来源:互联网 发布:steam联机游戏知乎 编辑:程序博客网 时间:2024/05/16 06:41
什么时候需要合并Crawl呢?当然是增加了新的起始url的情况下需要合并crawl。
首先,在$NUTCH_HOME的bin目录中建一个文件,名为mergecrawl,将它设置为可执行的,内容如下:
这个脚本的执行参数如下:
bin/mergecrawl newcrawl-path crawl1-path crawl2-path ....
注意,改完后一定不要忘记reload tomcat呀。
首先,在$NUTCH_HOME的bin目录中建一个文件,名为mergecrawl,将它设置为可执行的,内容如下:
>> CODE
#!/bin/bash
# Nutch merge crawls script.
# Based on recrawl script
#
# The script merges 2 or more nutch crawls into a single crawl
#
# USE ABSOLUTE PATHS for the script args
# e.g. bin/merge_crawls.sh /home/ren/nutch/trunk/build/crawl /home/ren/nutch/trunk/build_f/crawl/ /home/ren/nutch/trunk/build_w/crawl/
if [ -n "$1" ]
then
crawl_dir=$1
if [ -d $1 ]; then
echo "error: crawl already exists: '$1'"
exit 1
fi
else
echo "Usage: bin/mergecrawl newcrawl-path crawl1-path crawl2-path, USE ABSOLUTE PATHS"
exit 1
fi
if [ -n "$2" ]
then
crawl_1=$2
else
echo "Usage: bin/mergecrawl newcrawl-path crawl1-path crawl2-path, USE ABSOLUTE PATHS"
exit 1
fi
if [ -n "$3" ]
then
crawl_2=$3
else
echo "Usage: bin/mergecrawl newcrawl-path crawl1-path crawl2-path, USE ABSOLUTE PATHS"
exit 1
fi
#Sets the path to bin
nutch_dir=`dirname $0`
echo "Creating new crawl in: " $crawl_dir
mkdir $crawl_dir
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/segments
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index
echo Merge linkdb
$nutch_dir/nutch mergelinkdb $linkdb_dir $crawl_1/linkdb $crawl_2/linkdb
echo Merge crawldb
$nutch_dir/nutch mergedb $webdb_dir $crawl_1/crawldb $crawl_2/crawldb
echo Merge segments
segments_1=`ls -d $crawl_1/segments/*`
#echo 1 $segments_1
segments_2=`ls -d $crawl_2/segments/*`
#echo 2 $segments_2
$nutch_dir/nutch mergesegs $segments_dir $segments_1 $segments_2
# From there, identical to recrawl.sh
echo Update segments
$nutch_dir/nutch invertlinks $linkdb_dir -dir $segments_dir
echo Index segments
new_indexes=$crawl_dir/newindexes
segment=`ls -d $segments_dir/* | tail -1`
$nutch_dir/nutch index $new_indexes $webdb_dir $linkdb_dir $segment
echo De-duplicate indexes
$nutch_dir/nutch dedup $new_indexes
echo Merge indexes
$nutch_dir/nutch merge $index_dir $new_indexes
echo Some stats
$nutch_dir/nutch readdb $webdb_dir -stats
这个脚本的执行参数如下:
bin/mergecrawl newcrawl-path crawl1-path crawl2-path ....
注意,改完后一定不要忘记reload tomcat呀。
- nutch应用-合并Crawl
- nutch之crawl命令
- nutch中bin/crawl和bin/nutch crawl的用法
- Crawl the Nutch -- Map Reduce
- nutch crawl main函数流程
- 使用ant驱动nutch crawl
- Crawl the Nutch -- Map Reduce
- Crawl the Nutch -- Map Reduce
- 使用ant驱动nutch crawl
- nutch crawl的每一步
- Nutch Crawl执行过程解析
- bin/nutch crawl错误解决办法
- 从crawl 脚本看 nutch crawl过程 上
- Crawl The Nutch -- 起步 getting started
- Nutch-0.9源代码:Crawl类整体分析
- Nutch-0.9源代码:Crawl类整体分析
- Nutch抓取源码分析之Crawl类
- Dissecting The Nutch Crawler - Command "crawl": net.nutch.tools.CrawlTool
- 图象处理锐化代码
- Managed DirectX +C# 开发(入门篇)(三)
- 三十五岁前成功的十二条黄金法则
- NOD32
- 改变你人生的32句励志言语
- nutch应用-合并Crawl
- Java初学者容易混淆的几个问题
- .net Remoting vs WebServices (还没来得及翻译,先凑合着看吧!挺好的)
- 在 Java 应用程序中访问USB设备
- 访问网站出现Directory Listing Denied 是什么原因? http://www.west999.com/faq/list.asp?Unid=258
- Nutch 的配置文件 (收藏)
- Recursion
- 清除sqlserver2000数据库连接
- 利用netsh命令,实现动态IP和静态IP之间的切换。