CloudSuite之Web Search Benchmark环境搭建
来源:互联网 发布:优化驱动器有什么用 编辑:程序博客网 时间:2024/06/05 19:33
1. 这个benchmark使用Nutch搜索引擎来测试索引处理过程。由一个客户端机器(模拟真实client)和一个前端服务器(接受客户端请求并发送给索引节点进行处理)组成。
原安装文档地址:http://parsa.epfl.ch/cloudsuite/search.html
2. 需要的软件包(提前装好jdk 和 ant即可):
Nutch: (used Nutch v1.2).(server)
Faban kit.(client)
Tomcat.(frontend)
Search client driver (located in the package).(frontend)
Apache Ant and JDK(all nodes need)
3.四个节点10.1.1.101(client,安装Faban Kit)、10.1.103(master及frontend,frontend)、10.1.1.104(slave)、10.1.1.105(slave)
4. 注意集群内所有机器均需安装ssh服务;还有最好拥有root权限;接下来我们开始进入安装阶段,集群四个节点;
5. 软件包下载地址,其中有nutch、Tomcat、Fanban Kit安装包;http://parsa.epfl.ch/cloudsuite/software/search.tar.gz
(ps:安装faban的时候发现jdk7不能用,所以在101节点上配置的是jdk5,其他节点配置jdk7)
6. 建立集群结构,所有节点(101节点除外)均需配置;在/home/username目录下创建目录nutch-test,之后再在nutch-test目录下创建dis_search、search、filesystem、local、home、tomcat六个目录;
7. 首先我们配置master(103)节点:
- 配置好jdk和ant,之前的文章有提到,要确保JAVA_HOME指向正确的java目录;
- 解压缩search benchmark package中的Nutch压缩包:tar -zxvf apache-nutch-1.2-src.tar.gz;
- 在解压缩之后的包中新建build.properties文件,并写入以下内容:dist.dir=/home/username/nutch-test/search,下面是我配置的示例:
- 同样在解压缩的目录中,新建build目录;
- 在解压缩的目录下使用ant编译Nutch,使用以下命令:ant package;
- 编辑/home/username/nutch-test/search/conf/hadoop-env.sh文件,确保配置正确,以下是我的配置:
- 确保home/username/nutch-test/search/conf/slaves文件中写入了localhost
- 配置/home/username/nutch-test/search/conf/core-site.xml文件,fs.default.name即master节点ip地址,下面是我的配置:
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration><property><name>dfs.name.dir</name><value>/home/liulan/nutch-test/filesystem/name</value></property><property><name>dfs.data.dir</name><value>/home/liulan/nutch-test/filesystem/data</value></property><property><name>dfs.replication</name><value>1</value></property></configuration>
其中dfs.name.dir即是name node用来存储信息的,dfs.data.dir用来存储文件块;记住这些配置在master节点和slave节点上必须是一样的,接下来我们会讲到slave的配置; - 配置/home/username/nutch-test/search/conf/mapred-site.xml,增加以下内容(我的示例):
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration><property><name>mapred.job.tracker</name><value>10.1.1.103:9001</value></property><property><name>mapred.map.tasks</name><value>4</value></property><property><name>mapred.reduce.tasks</name><value>4</value></property><property><name>mapred.system.dir</name><value>/home/liulan/nutch-test/filesystem/mapreduce/system</value></property><property><name>mapred.local/dir</name><value>/home/liulan/nutch-test/filesystem/mapreduce/local</value></property></configuration>
job.tracker即master节点 - 因为节点之间需要建立连接,而我们的ssh登录是需要密码的,现在我们来看如何建立ssh密码登录:
- 在每个节点上都运行ssh-keygen -t rsa命令,接下来一直按enter直到结束;
- 在master节点上运行以下命令:
- cd/home/user-name/.ssh(找到.ssh所在目录,我用root运行的所以在/root/下)
- cp id_rsa.pub authorized_keys
- 然后将master节点的anthorized_keys拷贝其他节点的.ssh目录下,具体就自己想办法吧~
- 接下来format节点,在master节点上运行$HADOOP_HOME/bin/hadoopnamenode -format
- 启动hadoop(现在是单节点的):$HADOOP_HOME/bin/start-all.sh,试试访问http://master_ip_address:50070和http://master_ip_address:50030,没错的话就能看到master起来啦
- 为测试filesystem能否正常运行,执行以下命令:
mkdir $HADOOP_HOME/urlsdir# echo http://lucene.apache.org > $HADOOP_HOME/urlsdir/urllist.txt# $HADOOP_HOME/bin/hadoop dfs -put $HADOOP_HOME/urlsdir urlsdir$HADOOP_HOME/bin/hadoop dfs -ls urlsdir
具体解释看原安装文档吧(PS,每次name node format后这些都会消失的)
- 将master节点的/home/username/nutch-test/search目录拷贝到所有slave(104和105节点)节点的search文件夹中,注意master和slave文件目录的一致性;
- master节点上运行$HADOOP_HOME/bin/stop-all.sh结束hadoop进程;
- 修改master节点的slaves文件,上面有提到过,将所有slave节点(不包括client)的ip加入该文件,以下是我的示例:
localhost10.1.1.10510.1.1.104
- 启动hadoop:$HADOOP_HOME/bin/start-all.sh,在这个地方如果报以下错误,看到permission denied就知道是权限的问题,设置目录权限即可
- 此时访问http://10.1.1.103:50070/的live node数目,确保是正常的,不正常的话可以看日志文件(页面上有链接或者到slave和master的目录下去找,在logs文件里头),看具体出错原因,我就发现live node数目不对,看了日志文件有以下错误,解决办法参考http://www.cnblogs.com/justinzhang/p/4255303.html: 原因大家都知道了吧,就是因为namenode format后导致namespaceID变化,使data node的namespaceID和master的namespaceID不一致造成的;还有另外一个错误,查看slave的日志文件:
解决办法:关闭master节点的防火墙,这样通过50070端口可看到两个live节点;
编辑master的/home/username/nutch-test/search/conf目录下的crawl-urlfilter.txt文件,将+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/改成+^http://([a-z0-9]*\.)*apache.org/;
在所有slave(master也是)的/home/username/nutch-test/search/conf/nutch-site.xml文件中增加以下内容:
<configuration><property><name>http.agent.name</name><value>myOrganization</value><description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization.</description></property></configuration>
- 现在就可以执行我们的爬虫工作啦,执行以下命令(具体理解参考源文档):
#$HADOOP_HOME/bin/nutch crawl urlsdir -dir crawl -depth 3#$HADOOP_HOME/bin/hadoop dfs -copyToLocal crawl /home/username/nutch-test#$HADOOP_HOME/bin/stop-all.sh
- 首先将安装包中的tomcat解压缩到/home/username/nutch-test/tomcat directory (defined as TOMCAT_HOME)中,设置好环境变量;
- 进入到$TOMCAT_HOME/bin目录,执行命令:tar zxvf commons-daemon-native.tar.gz
- 然后切换到$TOMCAT_HOME/bin/commons-daemon-1.0.7-native-src/unix/目录,执行以下命令(注意先安装好gcc):
./configure make cp jsvc $TOMCAT_HOME/bin
- 切换到$TOMCAT_HOME目录下,执行以下目录:
bin/jsvc -cp ./bin/bootstrap.jar:./bin/tomcat-juli.jar -outfile ./logs/catalina.out -errfile ./logs/catalina.err org.apache.catalina.startup.bootstrap
- 接着执行以下命令(解压缩war file):
rm -rf $TOMCAT_HOME/webapps/ROOT/*cd $TOMCAT_HOME/webapps/ROOTcp $HADOOP_HOME/nutch-1.2.war ./jar -xvf nutch-1.2.war
- 编辑$TOMCAT_HOME/webapps/ROOT/WEB-INF/classes/nutch-site.xml文件,如下所示,注意search.dir值要根据自己的目录改写:
<property><name>fs.default.name</name><value>local</value></property><property><name>searcher.dir</name><value>/home/username/nutch-test</value></property><property><name>searcher.max.hits</name><value>-1</value></property><property><name>searcher.max.time.tick_count</name><value>-1</value></property><property><name>searcher.max.time.tick_length</name><value>30</value></property>
- 编辑master节点的/home/username/nutch-test/copy_index_and_segments.sh文件,我的示例如下:
echo "Creating indexes directory on servers"#add 2015.10.28nutch_parent_path=/home/liulan#server2=10.1.1.101server3=10.1.1.105mkdir $nutch_parent_path/nutch-test/local/indexes #ssh $server2 mkdir $nutch_parent_path/nutch-test/local/indexes &ssh $server3 mkdir $nutch_parent_path/nutch-test/local/indexes &echo "Indexes directory created. Copying indexes partitions to servers"cp -r $nutch_parent_path/nutch-test/crawl/indexes/part-00000 $nutch_parent_path/nutch-test/local/indexescp -r $nutch_parent_path/nutch-test/crawl/indexes/part-00001 $nutch_parent_path/nutch-test/local/indexes#scp -r $nutch_parent_path/nutch-test/crawl/indexes/part-00002 $server2:$nutch_parent_path/nutch-test/local/indexesscp -r $nutch_parent_path/nutch-test/crawl/indexes/part-00003 $server3:$nutch_parent_path/nutch-test/local/indexes#scp -r $nutch_parent_path/nutch-test/crawl/indexes/part-00002 $server2:$nutch_parent_path/nutch-test/local/indexesscp -r $nutch_parent_path/nutch-test/crawl/indexes/part-00003 $server3:$nutch_parent_path/nutch-test/local/indexesecho "Creating segments directory on servers"mkdir $nutch_parent_path/nutch-test/local/segments #ssh $server2 mkdir $nutch_parent_path/nutch-test/local/segments &ssh $server3 mkdir $nutch_parent_path/nutch-test/local/segments &for i in $( ls $nutch_parent_path/nutch-test/crawl/segments ); do echo "Creating directories on server1" mkdir $nutch_parent_path/nutch-test/local/segments/$i mkdir $nutch_parent_path/nutch-test/local/segments/$i/content mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_fetch mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_generate mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_parse mkdir $nutch_parent_path/nutch-test/local/segments/$i/parse_data mkdir $nutch_parent_path/nutch-test/local/segments/$i/parse_text #echo "Creating directories on server2" # ssh $server2 mkdir $nutch_parent_path/nutch-test/local/segments/$i & #ssh $server2 mkdir $nutch_parent_path/nutch-test/local/segments/$i/content & #ssh $server2 mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_fetch & #ssh $server2 mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_generate & #ssh $server2 mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_parse & #ssh $server2 mkdir $nutch_parent_path/nutch-test/local/segments/$i/parse_data & #ssh $server2 mkdir $nutch_parent_path/nutch-test/local/segments/$i/parse_text & #echo "Directories created on servers" echo "Creating directories on server3" ssh $server3 mkdir $nutch_parent_path/nutch-test/local/segments/$i & ssh $server3 mkdir $nutch_parent_path/nutch-test/local/segments/$i/content & ssh $server3 mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_fetch & ssh $server3 mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_generate & ssh $server3 mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_parse & ssh $server3 mkdir $nutch_parent_path/nutch-test/local/segments/$i/parse_data & ssh $server3 mkdir $nutch_parent_path/nutch-test/local/segments/$i/parse_text & echo "Directories created on servers" for k in $( ls $nutch_parent_path/nutch-test/crawl/segments/$i); do cp -r $nutch_parent_path/nutch-test/crawl/segments/$i/$k/part-00000 $nutch_parent_path/nutch-test/local/segments/$i/$k cp -r $nutch_parent_path/nutch-test/crawl/segments/$i/$k/part-00001 $nutch_parent_path/nutch-test/local/segments/$i/$k #scp -r $nutch_parent_path/nutch-test/crawl/segments/$i/$k/part-00002 $server2:$nutch_parent_path/nutch-test/local/segments/$i/$k #scp -r $nutch_parent_path/nutch-test/crawl/segments/$i/$k/part-00003 $server2:$nutch_parent_path/nutch-test/local/segments/$i/$k scp -r $nutch_parent_path/nutch-test/crawl/segments/$i/$k/part-00002 $server3:$nutch_parent_path/nutch-test/local/segments/$i/$k scp -r $nutch_parent_path/nutch-test/crawl/segments/$i/$k/part-00003 $server3:$nutch_parent_path/nutch-test/local/segments/$i/$k donedone
我的104节点down了,所以就把它的相关配置注释掉了; - 然后执行该脚本: ./copy_index_and_segments.sh,可能会出现cannot create directory......的错误,将对应节点的目录权限改一下即可;
- 接下来将slave节点(包括master)的search目录文件拷贝到dis_search目录中;
- 在dis_search/conf目录中创建hadoop-site.xml文件,写入以下内容:
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration><property><name>fs.default.name</name><value>local</value></property></configuration>
- 在master节点 /home/username/nutch-test/目录中创建search-servers.txt文件,并在其中加入slave节点(包括master)信息,如下所示:
10.1.1.103 889010.1.1.105 889010.1.1.104 8890
- 开启节点的索引服务,每个slave节点都执行以下命令,注意改变路径:
/home/username/nutch-test/dis_search/bin/nutch server 8890 /home/username/nutch-test/local &
- 开启Tomcat服务:$TOMCAT_HOME/bin/startup.sh,可查看tomcat的日志文件catalina.out文件看是否正常启动,可执行curl"http://the.ip.address.of.frontend:8080/search.jsp?query=apache",看能否成功;
- 解压缩faban到任意目录,如/home/username/faban(定义$FABAN_HOME);
- 将一开始下载的压缩包中的search文件夹拷贝到$FABAN_HOME目录下;
- 开启faban服务:$FABAN_HOME/master/bin/startup.sh
- 修改$FABAN_HOME/search目录下的build.properties文件:
bench.shortname=searchfaban.home=/fabanant.home=/usr/local/antfaban.url=http://localhost:9980/deploy.user=deployerdeploy.password=adminadmindeploy.clearconfig=truecompiler.target.version=1.5
- 在seach目录下执行ant deploy命令;
- 编辑$FABAN_HOME/search/deploy/run.xml文件的四个片段:
<jvmConfig xmlns="http://faban.sunsource.net/ns/fabanharness"> <javaHome>/usr/lib/jvm/jdk5</javaHome> <jvmOptions>-Xmx1g -Xms256m -XX:+DisableExplicitGC</jvmOptions> </jvmConfig>
<fa:hostConfig> <fa:host>10.1.1.101</fa:host> <fh:enabled>true</fh:enabled> <fh:cpus>0</fh:cpus> <fh:tools>NONE</fh:tools> <fh:userCommands></fh:userCommands> </fa:hostConfig>
<serverConfig> <ipAddress>10.1.1.103</ipAddress> <portNumber>8080</portNumber> </serverConfig>
<filesConfig> <logFile>/faban/queries.out</logFile> <termsFile>/faban/search/src/sample/searchdriver/terms_en.out</termsFile> </filesConfig>
- 修改client主机名为ip地址;
- jdk5好像会有些访问限制,造成run的时候runtimeexception: 解决办法是修改/usr/lib/jvm/jdk5/jre/lib/security/java.policy文件:在grant{}内加入:
permission java.net.SocketPermission "10.1.1.101:1024-", "accept,resolve";
注意ip地址(client的ip),大意就是accept 1024以上所有的端口连接;头一次遇到这种问题,解决了好开心;- 执行sh $FABAN_HOME/run.sh,如果这个目录下没有这个文件,就在faban里头搜索吧,好像在search里头;
- 正常的话就应该跑起来了,这只是一个小小的benchmark,估计跑其他的又会有各种问题,祝我好运吧,下面是我的summary.xml输出(多等会,不要着急):
<responseTimes unit="seconds"> <operation name="GET" r90th="0.500"> <avg>0.004</avg> <max>0.067</max> <sd>0.002</sd> <p90th>0.008</p90th> <passed>true</passed> <p99th>0.015</p99th> </operation> </responseTimes>
- CloudSuite之Web Search Benchmark环境搭建
- CloudSuite环境搭建
- Web开发之环境搭建
- CloudSuite之Graph Analytics集群安装
- web框架之Spring-MVC环境搭建
- Python Web开发之Django环境搭建
- web框架之Spring-MVC环境搭建 .
- 读书笔记之-WEB环境的搭建
- web框架之Spring-MVC环境搭建
- web框架之Spring-MVC环境搭建
- web框架之Spring-MVC环境搭建
- web框架之Spring-MVC环境搭建
- web框架之Spring-MVC环境搭建
- JAVA WEB开发之环境搭建
- web框架之Spring-MVC环境搭建
- Java Web开发之环境搭建
- Java Web开发环境搭建之windows
- 四.javaweb之web环境的搭建
- Android Volley完全解析(一),初识Volley的基本用法
- Vmware虚拟机下三种网络模式配置
- 辗转相除法(欧几里得算法)
- eclipse4 以后,RCP 界面开发configurer.setShellStyle(SWT.MIN | SWT.CLOSE);不生效的解决方法
- 学习windows驱动(缓冲区溢出)
- CloudSuite之Web Search Benchmark环境搭建
- Microsoft Dynamics CRM4.0学习笔记(二)
- IOS SWIFT alte形式获取手机的相册和相机
- Leetcode -- Word Break
- Codevs 1427 特种部队
- DIY Ruby CPU 分析——Part III
- 记Flume-NG一些注意事项
- CentOS 6.5配置SSH免密码登录
- LeetCode---Binary Tree Zigzag Level Order Traversal