apache-nutch-1.10 :QuickStartparseChecker
来源:互联网 发布:java byte base64 编辑:程序博客网 时间:2024/05/22 02:07
安装准备:
- 安装java
- 配置java_home
- mac 安装的是 apache ant,Linux安装的是apt-get
安装步骤:
- 创建一个新的目录
- 进入到目录中
- svn co https://svn.apache.org/repos/asf/nutch/trunk/
- 进入到trunk文件夹中
- 运行 $ ant runtime
- 编辑 conf/nutch-site.xml 将下面的代码添加至 区域中,”Value_name” 代替期望的名字。
<property> <name>http.agent.name</name> <value>Value_name</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description></property>
测试例子
./bin/nutch parsechecker -dumpText http://www.jpl.nasa.gov > jpl_out.txt
原文见下
链接
Requirement
install Java
set JAVA_HOME
install Apache Ant (brew install ant) if on Mac OSX, apt-get install ant if on Ubuntu/Linux
Steps
create a new directory
cd to directory
svn co https://svn.apache.org/repos/asf/nutch/trunk/
cd to trunk folder
run $ ant runtime
cd runtime/local/
edit conf/nutch-site.xml
add below code between section and replace “Value_name” with the desire name
http.agent.name
Value_name
HTTP ‘User-Agent’ request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agentshttp.agent.descriptionhttp.agent.urlhttp.agent.emailhttp.agent.version
and set their values appropriately.
run parsecheker for NASA JPL website for example by
./bin/nutch parsechecker -dumpText http://www.jpl.nasa.gov > jpl_out.txt
- apache-nutch-1.10 :QuickStartparseChecker
- apache-nutch-1.10 安装笔记
- Apache Nutch
- Apache-nutch-1.10 安装笔记(二)
- Nutch 开始被 Apache "孵化"
- Apache Nutch 1.7 单机安装
- org.apache.nutch.indexer之IndexingFilter
- Nutch org.apache.hadoop.util.DiskChecker$DiskErrorException
- Apache Nutch 1.3 学习笔记一
- Apache Nutch网页快照乱码的问题
- Nutch 2.0 之 Apache Gora MR介绍
- Nutch 2.0 之 Apache Gora MR介绍
- Installing Apache Nutch on Centos 6
- Build Apache Nutch With Solr 5.1.0
- Apache网络爬虫框架nutch安装教程
- [Nutch]Apache Solr的安装和配置
- Centos7安装配置Apache Nutch 1.12
- nutch
- Android开发之利用SQLite进行数据存储
- osg::Image转cv::Mat
- [php] 实现倒计时
- Based on or Basing on, 为何写作多用 Based on?
- StateListDrawable+LayerDrawable 概念的介绍
- apache-nutch-1.10 :QuickStartparseChecker
- Unity 3D引擎:十大最火的插件推荐
- 黑马程序员--IOS基础第十九天(OC)
- WPF 中双向绑定通知机制之ObservableCollection使用
- 正确使用 Volatile 变量
- protected
- linux 进程理解
- POJ 3748:位操作
- 关于 C变量的存储方式