sparkcrawler安装

来源:互联网 发布:淘宝宝贝自检工具 编辑:程序博客网 时间:2024/06/05 03:21
#1.新建文件夹
mkdir -p ~/sparkler
#2.下载solr安装包
cd ~/sparkler
wget http://mirror.bit.edu.cn/apache/lucene/solr/6.5.1/solr-6.5.1.tgz 


#3.下载sparkler源文件
cd ~/sparkler
git clone https://github.com/USCDataScience/sparkler.git


#4.解压solr-6.5.1.tgz文件
tar -zxvf solr-6.5.1.tgz


#5.删除solr-6.5.1.tgz文件
rm -f solr-6.5.1.tgz


#6.进入到sparkler源码文件(我这里是绝对路径)
cd ~/sparkler/sparkler/sparkler-ui


#7.新建文件夹并进入该文件夹
mkdir banana
cd banana
git submodule init
git submodule update


#8.进入到源码sparkler-ui目录并执行mvn clean package
cd ~/sparkler/sparkler/sparkler-ui
mvn clean package


#9.进入到sparkler源码根目录,并执行mvn clean install -DskipTests
cd ~/sparkler/sparkler/
mvn clean install -DskipTests


#10.进入solr目录
cd ~/sparkler/solr-6.5.1


#11.复制文件
cp -r ~/sparkler/sparkler/sparkler-ui/sparkler-dashboard ~/sparkler/solr-6.5.1/server/solr-webapp/


cp -r ~/sparkler/sparkler/conf/solr/sparkler-jetty-context.xml ~/sparkler/solr-6.5.1/server/contexts/


cp -rv ~/sparkler/sparkler/conf/solr/crawldb ~/sparkler/solr-6.5.1/server/solr/configsets/


cp -r ~/sparkler/solr-6.5.1/server/solr/configsets/crawldb ~/sparkler/solr-6.5.1/server/solr/


#12.启动solr
~/sparkler/solr-6.5.1/bin/solr start


13.浏览器访问http://localhost:8983/solr/#/~cores/
新增(add Core) ————》name和instanceDir两个字段值都为crawldb


14.浏览器访问http://localhost:8983/banana/#/dashboard
1.点击右上角的文件图标(第三个小图标)
2.选择文件---》~/sparkler/sparkler/sparkler-ui/dashboard/Sparkler-Dashboard-Basic
3.点击右上角保存图标(第四个小图标)
4.点击 Set as Browser Default选项


15.进入到sparkler源码根目录
cd ~/sparkler/sparkler/
bin/sparkler.sh inject -su http://www.sina.com.cn/
#执行会返回一个jobId值,请记录它(sjob-1496713811764)
bin/sparkler.sh crawl -id sjob-1496713811764  -m local[*] -i 1




我们就可以通过访问http://localhost:8983/banana/#/dashboard就可以看到数据了

也可以访问http://localhost:8983/solr/#/~cores/crawldb查看相关数据

本机安装需要git,jdk

原创粉丝点击