CloudSuite之Web Search Benchmark环境搭建

来源:互联网 发布:优化驱动器有什么用 编辑:程序博客网 时间:2024/06/05 19:33

1. 这个benchmark使用Nutch搜索引擎来测试索引处理过程。由一个客户端机器(模拟真实client)和一个前端服务器(接受客户端请求并发送给索引节点进行处理)组成。

 原安装文档地址:http://parsa.epfl.ch/cloudsuite/search.html

2. 需要的软件包(提前装好jdk 和 ant即可): 

    Nutch: (used Nutch v1.2).(server)

    Faban kit.(client)

   Tomcat.(frontend)

   Search client driver (located in the package).(frontend)

   Apache Ant and JDK(all nodes need)

 

3.四个节点10.1.1.101(client,安装Faban Kit)、10.1.103(master及frontend,frontend)、10.1.1.104(slave)、10.1.1.105(slave)


4. 注意集群内所有机器均需安装ssh服务;还有最好拥有root权限;接下来我们开始进入安装阶段,集群四个节点;


5. 软件包下载地址,其中有nutch、Tomcat、Fanban Kit安装包;http://parsa.epfl.ch/cloudsuite/software/search.tar.gz

   (ps:安装faban的时候发现jdk7不能用,所以在101节点上配置的是jdk5,其他节点配置jdk7)


6.  建立集群结构,所有节点(101节点除外)均需配置;在/home/username目录下创建目录nutch-test,之后再在nutch-test目录下创建dis_search、search、filesystem、local、home、tomcat六个目录;


7. 首先我们配置master(103)节点:

  • 配置好jdk和ant,之前的文章有提到,要确保JAVA_HOME指向正确的java目录;
  • 解压缩search benchmark package中的Nutch压缩包:tar -zxvf  apache-nutch-1.2-src.tar.gz;
  • 在解压缩之后的包中新建build.properties文件,并写入以下内容:dist.dir=/home/username/nutch-test/search,下面是我配置的示例:
  • 同样在解压缩的目录中,新建build目录;
  • 在解压缩的目录下使用ant编译Nutch,使用以下命令:ant package;
 这时候我们发现/home/username/nutch-test/search目录下已经生成了Nutch编译完的东西了;

8. 接下来配置Hadoop环境:
  • 编辑/home/username/nutch-test/search/conf/hadoop-env.sh文件,确保配置正确,以下是我的配置:
        
  • 确保home/username/nutch-test/search/conf/slaves文件中写入了localhost 
             
  • 配置/home/username/nutch-test/search/conf/core-site.xml文件,fs.default.name即master节点ip地址,下面是我的配置:
    <?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration><property><name>dfs.name.dir</name><value>/home/liulan/nutch-test/filesystem/name</value></property><property><name>dfs.data.dir</name><value>/home/liulan/nutch-test/filesystem/data</value></property><property><name>dfs.replication</name><value>1</value></property></configuration>
    其中dfs.name.dir即是name node用来存储信息的,dfs.data.dir用来存储文件块;记住这些配置在master节点和slave节点上必须是一样的,接下来我们会讲到slave的配置;
  • 配置/home/username/nutch-test/search/conf/mapred-site.xml,增加以下内容(我的示例):
    <?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration><property><name>mapred.job.tracker</name><value>10.1.1.103:9001</value></property><property><name>mapred.map.tasks</name><value>4</value></property><property><name>mapred.reduce.tasks</name><value>4</value></property><property><name>mapred.system.dir</name><value>/home/liulan/nutch-test/filesystem/mapreduce/system</value></property><property><name>mapred.local/dir</name><value>/home/liulan/nutch-test/filesystem/mapreduce/local</value></property></configuration>
    job.tracker即master节点
  • 因为节点之间需要建立连接,而我们的ssh登录是需要密码的,现在我们来看如何建立ssh密码登录: 
    1. 在每个节点上都运行ssh-keygen -t rsa命令,接下来一直按enter直到结束;
    2. 在master节点上运行以下命令:
      • cd/home/user-name/.ssh(找到.ssh所在目录,我用root运行的所以在/root/下)
      • cp id_rsa.pub authorized_keys
      • 然后将master节点的anthorized_keys拷贝其他节点的.ssh目录下,具体就自己想办法吧~
  • 接下来format节点,在master节点上运行$HADOOP_HOME/bin/hadoopnamenode -format
  • 启动hadoop(现在是单节点的):$HADOOP_HOME/bin/start-all.sh,试试访问http://master_ip_address:50070和http://master_ip_address:50030,没错的话就能看到master起来啦
  • 为测试filesystem能否正常运行,执行以下命令:
    mkdir $HADOOP_HOME/urlsdir# echo http://lucene.apache.org > $HADOOP_HOME/urlsdir/urllist.txt# $HADOOP_HOME/bin/hadoop dfs -put $HADOOP_HOME/urlsdir urlsdir$HADOOP_HOME/bin/hadoop dfs -ls urlsdir
    具体解释看原安装文档吧(PS,每次name node format后这些都会消失的)
9. 在多台机器上部署Nutch(之前的只在master上部署了,注意master也充当slave角色):

  • 将master节点的/home/username/nutch-test/search目录拷贝到所有slave(104和105节点)节点的search文件夹中,注意master和slave文件目录的一致性;
  • master节点上运行$HADOOP_HOME/bin/stop-all.sh结束hadoop进程;
  • 修改master节点的slaves文件,上面有提到过,将所有slave节点(不包括client)的ip加入该文件,以下是我的示例:
    localhost10.1.1.10510.1.1.104

  • 启动hadoop:$HADOOP_HOME/bin/start-all.sh,在这个地方如果报以下错误,看到permission denied就知道是权限的问题,设置目录权限即可
  • 此时访问http://10.1.1.103:50070/的live node数目,确保是正常的,不正常的话可以看日志文件(页面上有链接或者到slave和master的目录下去找,在logs文件里头),看具体出错原因,我就发现live node数目不对,看了日志文件有以下错误,解决办法参考http://www.cnblogs.com/justinzhang/p/4255303.html 原因大家都知道了吧,就是因为namenode format后导致namespaceID变化,使data node的namespaceID和master的namespaceID不一致造成的;还有另外一个错误,查看slave的日志文件:    

    解决办法:关闭master节点的防火墙,这样通过50070端口可看到两个live节点;

  • 编辑master的/home/username/nutch-test/search/conf目录下的crawl-urlfilter.txt文件,将+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/改成+^http://([a-z0-9]*\.)*apache.org/;

  • 在所有slave(master也是)的/home/username/nutch-test/search/conf/nutch-site.xml文件中增加以下内容:

    <configuration><property><name>http.agent.name</name><value>myOrganization</value><description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization.</description></property></configuration>                                                                                                                     

  • 现在就可以执行我们的爬虫工作啦,执行以下命令(具体理解参考源文档):
    #$HADOOP_HOME/bin/nutch crawl urlsdir -dir crawl -depth 3#$HADOOP_HOME/bin/hadoop dfs -copyToLocal crawl /home/username/nutch-test#$HADOOP_HOME/bin/stop-all.sh

10. 好了,接下来开始我们的frontend配置,同样是在103节点,注意client不要安在这些节点上:
  • 首先将安装包中的tomcat解压缩到/home/username/nutch-test/tomcat directory (defined as TOMCAT_HOME)中,设置好环境变量;
  • 进入到$TOMCAT_HOME/bin目录,执行命令:tar zxvf commons-daemon-native.tar.gz
  • 然后切换到$TOMCAT_HOME/bin/commons-daemon-1.0.7-native-src/unix/目录,执行以下命令(注意先安装好gcc):
    ./configure make cp jsvc $TOMCAT_HOME/bin
  • 切换到$TOMCAT_HOME目录下,执行以下目录:
    bin/jsvc -cp ./bin/bootstrap.jar:./bin/tomcat-juli.jar -outfile ./logs/catalina.out -errfile ./logs/catalina.err org.apache.catalina.startup.bootstrap

  • 接着执行以下命令(解压缩war file):
    rm -rf $TOMCAT_HOME/webapps/ROOT/*cd $TOMCAT_HOME/webapps/ROOTcp $HADOOP_HOME/nutch-1.2.war ./jar -xvf nutch-1.2.war

11. 接下来是分布式搜索阶段的配置(frontend所在节点):
  • 编辑$TOMCAT_HOME/webapps/ROOT/WEB-INF/classes/nutch-site.xml文件,如下所示,注意search.dir值要根据自己的目录改写:
    <property><name>fs.default.name</name><value>local</value></property><property><name>searcher.dir</name><value>/home/username/nutch-test</value></property><property><name>searcher.max.hits</name><value>-1</value></property><property><name>searcher.max.time.tick_count</name><value>-1</value></property><property><name>searcher.max.time.tick_length</name><value>30</value></property>

  • 编辑master节点的/home/username/nutch-test/copy_index_and_segments.sh文件,我的示例如下:
    echo "Creating indexes directory on servers"#add 2015.10.28nutch_parent_path=/home/liulan#server2=10.1.1.101server3=10.1.1.105mkdir $nutch_parent_path/nutch-test/local/indexes #ssh $server2 mkdir $nutch_parent_path/nutch-test/local/indexes &ssh $server3 mkdir $nutch_parent_path/nutch-test/local/indexes &echo "Indexes directory created. Copying indexes partitions to servers"cp -r $nutch_parent_path/nutch-test/crawl/indexes/part-00000 $nutch_parent_path/nutch-test/local/indexescp -r $nutch_parent_path/nutch-test/crawl/indexes/part-00001 $nutch_parent_path/nutch-test/local/indexes#scp -r $nutch_parent_path/nutch-test/crawl/indexes/part-00002 $server2:$nutch_parent_path/nutch-test/local/indexesscp -r $nutch_parent_path/nutch-test/crawl/indexes/part-00003 $server3:$nutch_parent_path/nutch-test/local/indexes#scp -r $nutch_parent_path/nutch-test/crawl/indexes/part-00002 $server2:$nutch_parent_path/nutch-test/local/indexesscp -r $nutch_parent_path/nutch-test/crawl/indexes/part-00003 $server3:$nutch_parent_path/nutch-test/local/indexesecho "Creating segments directory on servers"mkdir $nutch_parent_path/nutch-test/local/segments #ssh $server2 mkdir $nutch_parent_path/nutch-test/local/segments &ssh $server3 mkdir $nutch_parent_path/nutch-test/local/segments &for i in $( ls $nutch_parent_path/nutch-test/crawl/segments ); do  echo "Creating directories on server1"  mkdir $nutch_parent_path/nutch-test/local/segments/$i   mkdir $nutch_parent_path/nutch-test/local/segments/$i/content   mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_fetch   mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_generate   mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_parse   mkdir $nutch_parent_path/nutch-test/local/segments/$i/parse_data   mkdir $nutch_parent_path/nutch-test/local/segments/$i/parse_text   #echo "Creating directories on server2" # ssh $server2 mkdir $nutch_parent_path/nutch-test/local/segments/$i &  #ssh $server2 mkdir $nutch_parent_path/nutch-test/local/segments/$i/content &  #ssh $server2 mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_fetch &  #ssh $server2 mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_generate &  #ssh $server2 mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_parse &  #ssh $server2 mkdir $nutch_parent_path/nutch-test/local/segments/$i/parse_data &  #ssh $server2 mkdir $nutch_parent_path/nutch-test/local/segments/$i/parse_text &  #echo "Directories created on servers"  echo "Creating directories on server3"  ssh $server3 mkdir $nutch_parent_path/nutch-test/local/segments/$i &  ssh $server3 mkdir $nutch_parent_path/nutch-test/local/segments/$i/content &  ssh $server3 mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_fetch &  ssh $server3 mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_generate &  ssh $server3 mkdir $nutch_parent_path/nutch-test/local/segments/$i/crawl_parse &  ssh $server3 mkdir $nutch_parent_path/nutch-test/local/segments/$i/parse_data &  ssh $server3 mkdir $nutch_parent_path/nutch-test/local/segments/$i/parse_text &  echo "Directories created on servers"    for k in $( ls $nutch_parent_path/nutch-test/crawl/segments/$i);  do     cp -r $nutch_parent_path/nutch-test/crawl/segments/$i/$k/part-00000 $nutch_parent_path/nutch-test/local/segments/$i/$k     cp -r $nutch_parent_path/nutch-test/crawl/segments/$i/$k/part-00001 $nutch_parent_path/nutch-test/local/segments/$i/$k     #scp -r $nutch_parent_path/nutch-test/crawl/segments/$i/$k/part-00002 $server2:$nutch_parent_path/nutch-test/local/segments/$i/$k     #scp -r $nutch_parent_path/nutch-test/crawl/segments/$i/$k/part-00003 $server2:$nutch_parent_path/nutch-test/local/segments/$i/$k     scp -r $nutch_parent_path/nutch-test/crawl/segments/$i/$k/part-00002 $server3:$nutch_parent_path/nutch-test/local/segments/$i/$k     scp -r $nutch_parent_path/nutch-test/crawl/segments/$i/$k/part-00003 $server3:$nutch_parent_path/nutch-test/local/segments/$i/$k  donedone
    我的104节点down了,所以就把它的相关配置注释掉了;
  • 然后执行该脚本: ./copy_index_and_segments.sh,可能会出现cannot create directory......的错误,将对应节点的目录权限改一下即可;
  • 接下来将slave节点(包括master)的search目录文件拷贝到dis_search目录中;
  • 在dis_search/conf目录中创建hadoop-site.xml文件,写入以下内容:
    <?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration><property><name>fs.default.name</name><value>local</value></property></configuration>
  • 在master节点 /home/username/nutch-test/目录中创建search-servers.txt文件,并在其中加入slave节点(包括master)信息,如下所示:
    10.1.1.103 889010.1.1.105 889010.1.1.104 8890

  • 开启节点的索引服务,每个slave节点都执行以下命令,注意改变路径:
    /home/username/nutch-test/dis_search/bin/nutch server 8890 /home/username/nutch-test/local &

  • 开启Tomcat服务:$TOMCAT_HOME/bin/startup.sh,可查看tomcat的日志文件catalina.out文件看是否正常启动,可执行curl"http://the.ip.address.of.frontend:8080/search.jsp?query=apache",看能否成功;
12. 接下来是client的配置(101节点):
  • 解压缩faban到任意目录,如/home/username/faban(定义$FABAN_HOME);
  • 将一开始下载的压缩包中的search文件夹拷贝到$FABAN_HOME目录下;
  • 开启faban服务:$FABAN_HOME/master/bin/startup.sh
  • 修改$FABAN_HOME/search目录下的build.properties文件:
    bench.shortname=searchfaban.home=/fabanant.home=/usr/local/antfaban.url=http://localhost:9980/deploy.user=deployerdeploy.password=adminadmindeploy.clearconfig=truecompiler.target.version=1.5
  • 在seach目录下执行ant deploy命令;
  • 编辑$FABAN_HOME/search/deploy/run.xml文件的四个片段:
     <jvmConfig xmlns="http://faban.sunsource.net/ns/fabanharness">        <javaHome>/usr/lib/jvm/jdk5</javaHome>        <jvmOptions>-Xmx1g -Xms256m -XX:+DisableExplicitGC</jvmOptions>    </jvmConfig>

     <fa:hostConfig>            <fa:host>10.1.1.101</fa:host>            <fh:enabled>true</fh:enabled>            <fh:cpus>0</fh:cpus>            <fh:tools>NONE</fh:tools>            <fh:userCommands></fh:userCommands>        </fa:hostConfig>

    <serverConfig>        <ipAddress>10.1.1.103</ipAddress>        <portNumber>8080</portNumber>    </serverConfig>

    <filesConfig>    <logFile>/faban/queries.out</logFile>    <termsFile>/faban/search/src/sample/searchdriver/terms_en.out</termsFile>    </filesConfig>

  • 修改client主机名为ip地址;
  • jdk5好像会有些访问限制,造成run的时候runtimeexception:  解决办法是修改/usr/lib/jvm/jdk5/jre/lib/security/java.policy文件:在grant{}内加入:
  • permission java.net.SocketPermission "10.1.1.101:1024-", "accept,resolve";
    注意ip地址(client的ip),大意就是accept 1024以上所有的端口连接;头一次遇到这种问题,解决了好开心;
  • 执行sh $FABAN_HOME/run.sh,如果这个目录下没有这个文件,就在faban里头搜索吧,好像在search里头;
  • 正常的话就应该跑起来了,这只是一个小小的benchmark,估计跑其他的又会有各种问题,祝我好运吧,下面是我的summary.xml输出(多等会,不要着急):
    <responseTimes unit="seconds">            <operation name="GET" r90th="0.500">                <avg>0.004</avg>                <max>0.067</max>                <sd>0.002</sd>                <p90th>0.008</p90th>                <passed>true</passed>                <p99th>0.015</p99th>            </operation>        </responseTimes>

终于写完了,这一周就贡献给了服务器和这个benchmark,加油吧!

1 0
原创粉丝点击