apache-nutch-1.10 :QuickStartparseChecker

来源:互联网 发布:java byte base64 编辑:程序博客网 时间:2024/05/22 02:07

安装准备:

  • 安装java
  • 配置java_home
  • mac 安装的是 apache ant,Linux安装的是apt-get

安装步骤:

  • 创建一个新的目录
  • 进入到目录中
  • svn co https://svn.apache.org/repos/asf/nutch/trunk/
  • 进入到trunk文件夹中
  • 运行 $ ant runtime
  • 编辑 conf/nutch-site.xml 将下面的代码添加至 区域中,”Value_name” 代替期望的名字。
<property>  <name>http.agent.name</name>  <value>Value_name</value>  <description>HTTP 'User-Agent' request header. MUST NOT be empty -  please set this to a single word uniquely related to your organization.  NOTE: You should also check other related properties:    http.robots.agents    http.agent.description    http.agent.url    http.agent.email    http.agent.version  and set their values appropriately.  </description></property>

测试例子

./bin/nutch parsechecker -dumpText http://www.jpl.nasa.gov > jpl_out.txt

原文见下
链接

Requirement

install Java
set JAVA_HOME
install Apache Ant (brew install ant) if on Mac OSX, apt-get install ant if on Ubuntu/Linux
Steps

create a new directory
cd to directory
svn co https://svn.apache.org/repos/asf/nutch/trunk/
cd to trunk folder
run $ ant runtime
cd runtime/local/
edit conf/nutch-site.xml
add below code between section and replace “Value_name” with the desire name


http.agent.name
Value_name
HTTP ‘User-Agent’ request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agentshttp.agent.descriptionhttp.agent.urlhttp.agent.emailhttp.agent.version

and set their values appropriately.



run parsecheker for NASA JPL website for example by

./bin/nutch parsechecker -dumpText http://www.jpl.nasa.gov > jpl_out.txt

0 0
原创粉丝点击