ubuntu下nutch-1.0的安装和配置错误排除

来源:互联网 发布:枪械百科软件 编辑:程序博客网 时间:2024/06/05 17:47

一、安装JDK(笔者推荐使用原生的方式安装SUN-JDK6(这个方面的安装指导和错误排除可以看http://blog.163.com/sukerl@126/blog/static/112027649200941110432596))
一、保证TOMCAT的正常安装,这个方面的安装指导和错误排除可以看
http://blog.163.com/sukerl@126/blog/static/11202764920094610148888/
二,下载nutch-1.0,解压后,并将它拷贝到/opt/目录下。
cd  /opt/nutch-1.0
root@fjadmin-webcrawler:/opt/nutch-1.0# sh bin/nutch crawl
一般来说没有设置JAVA_HOME等环境,会报以下错误:
[: 72: ==: unexpected operator
Error: JAVA_HOME is not set.
这里应编辑gedit /etc/environment来解决这个错误
进入root帐号:
sudo su
然后编辑environment
gedit /etc/environment
编辑后内容如下:
JAVA_HOME="/usr/lib/jvm/jdk/jdk1.6.0_13"
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/lib/jvm/jdk/jdk1.6.0_13/bin"
LANGUAGE="zh_CN:zh:en_US:en"
LANG="zh_CN.UTF-8"
CLASSPATH=/usr/lib/jvm/jdk/jdk1.6.0_13/lib:/usr/lib/jvm/jdk/jdk1.6.0_13/jre/lib



其中/usr/lib/jvm/jdk/jdk1.6.0_13是SUN-JDK的安装位置,笔者将它装到了/usr/lib/jvm/jdk/jdk1.6.0_13目录下

然后重启UBUNTU或者注销
接着进入root帐号
sudo su
再次尝试
cd  /opt/nutch-1.0
root@fjadmin-webcrawler:/opt/nutch-1.0# sh bin/nutch crawl
一般可能再次报错:
root@fjadmin-webcrawler:/opt/nutch-1.0# sh bin/nutch crawl
[: 72: ==: unexpected operator
[: 132: ==: unexpected operator
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/crawl/Crawl
Caused by: java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl
    at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
    at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
Could not find the main class: org.apache.nutch.crawl.Crawl.  Program will exit.
表面上看上去这个错误是CLASSPATH设置错误,其实并不完全如此,
使用
gedit bin/nutch
打开这个nutch看个明白,到底怎么回事?
可以看到(红色标的是重点)
# some Java parameters
if [ "$NUTCH_JAVA_HOME" != "" ]; then
  #echo "run java in $NUTCH_JAVA_HOME"
  JAVA_HOME=$NUTCH_JAVA_HOME
fi

if [ "$JAVA_HOME" = "" ]; then
  echo "Error: JAVA_HOME is not set."
  exit 1
fi
上面的红色可以看出当前没有设置JAVA_HOME时,会提示Error: JAVA_HOME is not set.这个错误我们已经在刚才遇到了

JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx1000m

# check envvars which might override default args
if [ "$NUTCH_HEAPSIZE" != "" ]; then
  #echo "run with heapsize $NUTCH_HEAPSIZE"
  JAVA_HEAP_MAX="-Xmx""$NUTCH_HEAPSIZE""m"
  #echo $JAVA_HEAP_MAX
fi

# CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to $NUTCH_HOME/conf
CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf}
CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar

# so that filenames w/ spaces are handled correctly in loops below
IFS=

# for developers, add plugins, job & test code to CLASSPATH
if [ -d "$NUTCH_HOME/build/plugins" ]; then
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build
fi
if [ -d "$NUTCH_HOME/build/test/classes" ]; then
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/test/classes
fi

if [ $IS_CORE == 0 ]
then
  for f in $NUTCH_HOME/build/nutch-*.job; do
    CLASSPATH=${CLASSPATH}:$f;
  done

  # for releases, add Nutch job to CLASSPATH
  for f in $NUTCH_HOME/nutch-*.job; do
    CLASSPATH=${CLASSPATH}:$f;
  done
else
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/classes
fi

# add plugins to classpath
if [ -d "$NUTCH_HOME/plugins" ]; then
  CLASSPATH=${NUTCH_HOME}:${CLASSPATH}
fi

# add libs to CLASSPATH
for f in $NUTCH_HOME/lib/*.jar; do
  CLASSPATH=${CLASSPATH}:$f;
done
这是这个错误产生的关键,可以打开lib目录看一下,有如下文件($NUTCH_HOME是指当前nutch的运行目录,当前是在/opt/nutch-1.0处)
apache-solr-common-1.3.0.jar  commons-httpclient-3.0.1.jar   icu4j-4_0_1.LICENSE.txt  jetty-ext                native                   xerces-2_6_2.jar
apache-solr-solrj-1.3.0.jar    commons-lang-2.1.jar           jakarta-oro-2.0.8.jar     junit-3.8.1.jar          servlet-api.jar
commons-beanutils-1.8.0.jar   commons-logging-1.0.4.jar      jets3t-0.6.1.jar         junit-3.8.1.LICENSE.txt  taglibs-i18n.jar
commons-cli-2.0-SNAPSHOT.jar  commons-logging-api-1.0.4.jar  jets3t-0.6.1.LICENSE.txt log4j-1.2.15.jar         taglibs-i18n.tld
commons-codec-1.3.jar         hadoop-0.19.1-core.jar         jetty-5.1.4.jar          lucene-core-2.4.0.jar    tika-0.1-incubating.jar
commons-collections-3.2.1.jar icu4j-4_0_1.jar                jetty-5.1.4.LICENSE.txt  lucene-misc-2.4.0.jar    xerces-2_6_2-apis.jar
但是就是无法找到org.apache.nutch.crawl.Crawl所在的JAR文件这些CLASS文件都被打包成JAR文件了,那么这个JAR文件到哪去了呢?
我们进入/opt/nutch-1.0
ls一下,原来在根目录下,靠!
       CHANGES.txt  default.properties  KEYS  LICENSE.txt  NOTICE.txt    nutch-1.0.war  nutchstart.sh~  README.txt  taxurls.txt  webapps

build.xml  conf         docs                lib   logs       nutch-1.0.jar  nutch-1.0.job  nutchstart.sh  plugins         src         taxweb
注意这个nutch-1.0.jar就是整个org.apache.nutch的核心CLASS文件的打包(包括Crawl),原因已经找到,在CLASSPATH中根本就没有这个 nutch-1.0.jar文件,因为这个文件根本就不在$NUTCH_HOME/lib/*.jar范围中,所以会出现找不到org.apache.nutch.crawl.Crawl类的情况


for f in $NUTCH_HOME/lib/jetty-ext/*.jar; do
  CLASSPATH=${CLASSPATH}:$f;
done

# cygwin path translation
if $cygwin; then
  CLASSPATH=`cygpath -p -w "$CLASSPATH"`
fi

问题已经找到
现在我们把nutch-1.0.jar拷贝到/opt/nutch-1.0/lib下
OK
成功解决了这个难题(这个难题我GOOGLE了很久也没有结果,只好自己搞定)
root@fjadmin-webcrawler:/opt/nutch-1.0# sh bin/nutch crawl
[: 72: ==: unexpected operator
[: 132: ==: unexpected operator
Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN N]

 

原文转自:http://blog.csdn.net/deepfuture/archive/2009/12/23/5064991.aspx

原创粉丝点击