How to Setup Nutch and Hadoop

来源：互联网发布：网页美工教学视频编辑：程序博客网时间：2024/05/11 12:34

How to Setup Nutch and Hadoop

Aftersearching the web and mailing lists, it seems that there is very littleinformation on how to setup Nutch using the Hadoop (formerly NDFS)distributed file system (HDFS) and MapReduce.The purpose of this tutorial is to provide a step-by-step method to getNutch running with Hadoop file system on multiple machines, includingbeing able to both index (crawl) and search across multiple machines.

Thisdocument does not go into the Nutch or Hadoop architecture. It onlytells how to get the systems up and running. At the end of the tutorialthough I will point you to relevant resources if you want to know moreabout the architecture of Nutch and Hadoop.

Some things are assumed for this tutorial:

First,I performed some setup and using root level access. This includedsetting up the same user across multiple machines and setting up alocal filesystem outside of the user's home directory. Root access isnot required to setup Nutch and Hadoop (although sometimes it isconvienent). If you do not have root access, you will need the sameuser setup across all machines which you are using and you willprobably need to use a local filesystem inside of your home directory.

Two, all boxes will need an SSH server running (not just a client) as Hadoop uses SSH to start slave servers.

Three, this tutorial uses Whitebox Enterprise Linux 3 Respin 2 (WHEL). For those of you who don't know Whitebox, it is a RedHat Enterprise Linux clone. You should be able to follow along for any linux system, but the systems I use are Whitebox.

Four,this tutorial uses Nutch 0.8 Dev Revision 385702, and may not becompatible with future releases of either Nutch or Hadoop.

Five,for this tutorial we setup nutch across 6 different computers. If youare using a different number of machines you should still be fine butyou should have at least two different machines to prove thedistributed capabilities of both HDFS and MapReduce.

Six,in this tutorial we build Nutch from source. There are nightly buildsof both Nutch and Hadoop available and I will give you those urlslater.

Seven,remember that this is a tutorial from my personal experience seting upNutch and Hadoop. If something doesn't work for you try searching andsending a message to the Nutch or Hadoop users mailing list. And asalways suggestions are welcome to help improve this tutorial forothers.

Our Network Setup

Firstlet me layout the computers that we used in our setup. To setup Nutchand Hadoop we had 7 commodity computers ranging from 750Mghz to 1.0Ghz. Each computer had at least 128 Megs of RAM and at least a 10Gigabyte hard drive. One computer had dual 750 Mghz CPUs and anotherhad dual 30 Gigabyte hard drives. All of these computers werepurchasedfor under $500.00 at a liquidation sale. I am telling you thisto let you know that you don't have to have big hardware to get up andrunning with Nutch and Hadoop. Our computers were named like this:

devcluster01
devcluster02
devcluster03
devcluster04
devcluster05
devcluster06

Ourmaster node was devcluster01. By master node I mean that it ran theHadoop services that coordinated with the slave nodes (all of the othercomputers) and it was the machine on which we performed our crawl anddeployed our search website.

Downloading Nutch and Hadoop

BothNutch and Hadoop are downloadable from the apache website. Thenecessary Hadoop files are bundled with Nutch so unless you are goingto be developing Hadoop you only need to download Nutch.

We built Nutch from source after downloading it from its subversion repository. There are nightly builds of both Nutch and Hadoop here:

http://cvs.apache.org/dist/lucene/nutch/nightly/

http://cvs.apache.org/dist/lucene/hadoop/nightly/

Iam using eclipse for development so I used the eclipse plugin forsubversion to download both the Nutch and Hadoop repositories. Thesubversion plugin for eclipse can be downloaded through the updatemanager using the url:

http://subclipse.tigris.org/update_1.0.x

Ifyou are not using eclipse you will need to get a subversion client.Once you have a subversion client you can either browse the Nutchsubversion webpage at:

http://lucene.apache.org/nutch/version_control.html

Or you can access the Nutch subversion repository through the client at:

http://svn.apache.org/repos/asf/lucene/nutch/

Ichecked out the main trunk into my eclipse but it can be checked out toa standard filesystem as well. We are going to use ant to build it soif you have java and ant installed you should be fine.

Iam not going to go into how to install java or ant, if you are workingwith this level of software you should know how to do that and thereare plenty of tutorial on building software with ant. If you want acomplete reference for ant pick up Erik Hatcher's book "Java Development with Ant":

http://www.manning.com/hatcher

Building Nutch and Hadoop

Once you have Nutch downloaded go to the download directory where you should see the following folders and files:

+ bin
+ conf
+ docs
+ lib
+ site
+ src
    build.properties (add this one)
    build.xml
    CHANGES.txt
    default.properties
    index.html
    LICENSE.txt
    README.txt

Adda build.properties file and inside of it add a variable called dist.dirwith its value being the location where you want to build nutch. So ifyou are building on a linux machine it would look something like this:

dist.dir=/path/to/build

Thisstep is actually optional as Nutch will create a build directory insideof the directory where you unzipped it by default, but I preferbuilding it to an external directory. You can name the build directoryanything you want but I recommend using a new empty folder to buildinto. Remember to create the build folder if it doesn't already exist.

To build nutch call the package ant task like this:

ant package

Thisshould build nutch into your build folder. When it is finished you areready to move on to deploying and configuring nutch.

Setting Up The Deployment Architecture

Oncewe get nutch deployed to all six machines we are going to call a scriptcalled start-all.sh that starts the services on the master node anddata nodes. This means that the script is going to start the hadoopdaemons on the master node and then will ssh into all of the slavenodes and start daemons on the slave nodes.

Thestart-all.sh script is going to expect that nutch is installed inexactly the same location on every machine. It is also going to expectthat Hadoop is storing the data at the exact same filepath on everymachine.

Theway we did it was to create the following directory structure on everymachine. The search directory is where Nutch is installed. Thefilesystem is the root of the hadoop filesystem. The home directory isthe nutch users's home directory. On our master node we also installeda tomcat 5.5 server for searching.

/nutch
  /search
    (nutch installation goes here)
  /filesystem
  /local (used for local directory for searching)
  /home
    (nutch user's home directory)
  /tomcat    (only on one server for searching)

Iam not going to go into detail about how to install tomcat as againthere are plenty of tutorials on how to do that. I will say that weremoved all of the wars from the webapps directory and created a foldercalled ROOT under webapps into which we unzipped the Nutch war file(nutch-0.8-dev.war). This makes it easy to edit configuration filesinside of the Nutch war

Solog into the master nodes and all of the slave nodes as root. Createthe nutch user and the different filesystems with the followingcommands:

ssh -l root devcluster01

mkdir /nutch
mkdir /nutch/search
mkdir /nutch/filesystem
mkdir /nutch/local
mkdir /nutch/home

groupadd users
useradd -d /nutch/home -g users nutch
chown -R nutch:users /nutch
passwd nutch nutchuserpassword

Againif you don't have root level access you will still need the same useron each machine as the start-all.sh script expects it. It doesn't haveto be a user named nutch user although that is what we use. Also youcould put the filesystem under the common user's home directory.Basically, you don't have to be root, but it helps.

Thestart-all.sh script that starts the daemons on the master and slavenodes is going to need to be able to use a password-less login throughssh. For this we are going to have to setup ssh keys on each of thenodes. Since the master node is going to start daemons on itself wealso need the ability to user a password-less login on itself.

Youmight have seen some old tutorials or information floating around theuser lists that said you would need to edit the SSH daemon to allow theproperty PermitUserEnvironmentand to setup local environment variables for the ssh logins through anenvironment file. This has changed. We no longer need to edit the sshdaemon and we can setup the environment variables inside of thehadoop-env.sh file. Open the hadoop-env.sh file inside of vi:

cd /nutch/search/conf
vi hadoop-env.sh

Below is a template for the environment variables that need to be changed in the hadoop-env.sh file:

export HADOOP_HOME=/nutch/search
export JAVA_HOME=/usr/java/jdk1.5.0_06
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves

Thereare other variables in this file which will affect the behavior ofHadoop. If when you start running the script later you start gettingssh errors, try changing the HADOOP_SSH_OPTS variable. Note also that,after the initial copy, you can set HADOOP_MASTER in yourconf/hadoop-env.sh and it will use rsync changes on the master to eachslave node. There is a section below on how to do this.

Nextwe are going to create the keys on the master node and copy them overto each of the slave nodes. This must be done as the nutch user wecreated earlier. Don't just su in as the nutch user, start up a newshell and login as the nutch user. If you su in the password-less loginwe are about to setup will not work in testing but will work when a newsession is started as the nutch user.

cd /nutch/home

ssh-keygen -t rsa (Use empty responses for each prompt)
  Enter passphrase (empty for no passphrase): 
  Enter same passphrase again: 
  Your identification has been saved in /nutch/home/.ssh/id_rsa.
  Your public key has been saved in /nutch/home/.ssh/id_rsa.pub.
  The key fingerprint is:
  a6:5c:c3:eb:18:94:0b:06:a1:a6:29:58:fa:80:0a:bc nutch@localhost

On the master node you will copy the public key you just created to a file called authorized_keys in the same directory:

cd /nutch/home/.ssh
cp id_rsa.pub authorized_keys

Youonly have to run the ssh-keygen on the master node. On each of theslave nodes after the filesystem is created you will just need to copythe keys over using scp.

scp /nutch/home/.ssh/authorized_keys nutch@devcluster02:/nutch/home/.ssh/authorized_keys

Youwill have to enter the password for the nutch user the first time. Anssh propmt will appear the first time you login to each computer askingif you want to add the computer to the known hosts. Answer yes to thepropmt. Once the key is copied you shouldn't have to enter a passwordwhen logging in as the nutch user. Test it by logging into the slavenodes that you just copied the keys to:

ssh devcluster02
nutch@devcluster02$ (a command prompt should appear without requiring a password)
hostname (should return the name of the slave node, here devcluster02)

Once we have the ssh keys created we are ready to start deploying nutch to all of the slave nodes.

Deploy Nutch to Single Machine

Firstwe will deploy nutch to a single node, the master node, but operate itin distributed mode. This means that it will use the Hadoop filesysteminstead of the local filesystem. We will start with a single node tomake sure that everything is up and running and will then move on toadding the other slave nodes. All of the following should be done froma session started as the nutch user. Weare going to setup nutch on the master node and then when we are readywe will copy the entire installation to the slave nodes.

First copy the files from the nutch build to the deploy directory using something like the following command:

cp -R /path/to/build/* /nutch/search

Then make sure that all of the shell scripts are in unix format and are executable.

dos2unix /nutch/search/bin/*.sh /nutch/search/bin/hadoop /nutch/search/bin/nutch
chmod 700 /nutch/search/bin/*.sh /nutch/search/bin/hadoop /nutch/search/bin/nutch
dos2unix /nutch/search/config/*.sh
chmod 700 /nutch/search/config/*.sh

Whenwe were first trying to setup nutch we were getting bad interpreter andcommand not found errors because the scripts were in dos format onlinux and not executable. Notice that we are doing both the bin andconfig directory. In the config directory there is a file calledhadoop-env.sh that is called by other scripts.

Thereare a few scripts that you will need to be aware of. In the bindirectory there is the nutch script, the hadoop script, thestart-all.sh script and the stop-all.sh script. The nutch script isused to do things like start the nutch crawl. The hadoop script allowsyou it interact with the hadoop file system. The start-all.sh script starts all of the servers on the master and slave nodes. The stop-all.sh. scrip stops all of the servers.

If you want to see options for nutch use the following command:

bin/nutch

Or if you want to see the options for hadoop use:

bin/hadoop

Ifyou want to see options for Hadoop components such as the distributedfilesystem then use the component name as input like below:

bin/hadoop dfs

Thereare also files that you need to be aware of. In the conf directorythere are the nutch-default.xml, the nutch-site.xml, thehadoop-default.xml and the hadoop-site.xml. The nutch-default.xml fileholds all of the default options for nutch, the hadoop-default.xml filedoes the same for hadoop. To override any of these options, we copy theproperties to their respective *-site.xml files and change theirvalues. Below I will give you an example hadoop-site.xml file and latera nutch-site.xml file.

Thereis also a file named slaves inside the config directory. This is wherewe put the names of the slave nodes. Since we are running a slave datanode on the same machine we are running the master node, we will alsoneed the local computer in this slave list. Here is what the slavesfile will look like to start.

localhost

Itcomes this way to start so you shouldn't have to make any changes.Later we will add all of the nodes to this file, one node per line.Below is an example hadoop-site.xml file.

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>fs.default.name</name>
  <value>devcluster01:9000</value>
  <description>
    The name of the default file system. Either the literal string 
    "local" or a host:port for NDFS.
  </description>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>devcluster01:9001</value>
  <description>
    The host and port that the MapReduce job tracker runs at. If 
    "local", then jobs are run in-process as a single map and 
    reduce task.
  </description>
</property>

<property> 
  <name>mapred.map.tasks</name>
  <value>2</value>
  <description>
    define mapred.map tasks to be number of slave hosts
  </description> 
</property> 

<property> 
  <name>mapred.reduce.tasks</name>
  <value>2</value>
  <description>
    define mapred.reduce tasks to be number of slave hosts
  </description> 
</property> 

<property>
  <name>dfs.name.dir</name>
  <value>/nutch/filesystem/name</value>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/nutch/filesystem/data</value>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>/nutch/filesystem/mapreduce/system</value>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/nutch/filesystem/mapreduce/local</value>
</property>

<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>

</configuration>

Thefs.default.name property is used by nutch to determine the filesystemthat it is going to use. Since we are using the hadoop filesystem wehave to point this to the hadoop master or name node. In this case itis devcluster01:9000 which is the server that houses the name node onour network.

Thehadoop package really comes with two components. One is the distributedfilesystem. Two is the mapreduce functionality. While the distibutedfilesystem allows you to store and replicate files over many commoditymachines, the mapreduce package allows you to easily perform parallelprogramming tasks.

Thedistributed file system has name nodes and data nodes. When a clientwants to manipulate a file in the file system it contacts the name nodewhich then tells it which data node to contact to get the file. Thename node is the coordinator and stores what blocks (not really filesbut you can think of them as such for now) are on what computers andwhat needs to be replicated to different data nodes. The data nodes arejust the workhorses. They store the actual files, serve them up onrequest, etc. So if you are running a name node and a data node on thesame computer it is still communicating over sockets as if the datanode was on a different computer.

Iwon't go into detail here about how mapreduce works, that is a topicfor another tutorial and when I have learned it better myself I willwrite one, but simply put mapreduce breaks programming tasks into mapoperations (a -> b,c,d) and reduce operations (list -> a). Once aprobelm has been broken down into map and reduce operations thenmultiple map operations and multiple reduce operations can bedistributed to run on different servers in parallel. So instead ofhanding off a file to a filesystem node, we are handing off aprocessing operation to a node which then processes it and returns theresult to the master node. The coordination server for mapreduce iscalled the mapreduce job tracker. Each node that performs processinghas a daemon called a task tracker that runs and communicates with themapreduce job tracker.

Thenodes for both the filesystem and mapreduce communicate with theirmasters through a continuous heartbeat (like a ping) every 5-10 secondsor so. If the heartbeat stops then the master assumes the node is downand doesn't use it for future operations.

Themapred.job.tracker property specifies the master mapreduce tracker so Iguess it is possible to have the name node and the mapreduce tracker ondifferent computers. That is something I have not done yet.

Themapred.map.tasks and mapred.reduce.tasks properties tell how many tasksyou want to run in parallel. This should be a multiple of the number ofcomputers that you have. In our case since we are starting out with 1computer we will have 2 map and 2 reduce tasks. Later we will increasethese values as we add more nodes.

Thedfs.name.dir property is the directory used by the name node to storetracking and coordination information for the data nodes.

Thedfs.data.dir property is the directory used by the data nodes to storethe actual filesystem data blocks. Remember that this is expected to bethe same on every node.

Themapred.system.dir property is the directory that the mapreduce trackeruses to store its data. This is only on the tracker and not on themapreduce hosts.

Themapred.local.dir property is the directory on the nodes that mapreduceuses to store its local data. I have found that mapreduce uses a hugeamount of local space to perform its tasks (i.e. in the Gigabytes).That may just be how I have my servers configured though. I have alsofound that the intermediate files produced by mapreduce don't seem toget deleted when the task exits. Again that may be my configuration.This property is also expected to be the same on every node.

Thedfs.replication property states how many servers a single file shouldbe replicated to before it becomes available. Because we are using onlya single server for right now we have this at 1. If you set this valuehigher than the number of data nodes that you have available then youwill start seeing alot of (Zero targets found, forbidden1.size=1) typeerrors in the logs. We will increase this value as we add more nodes.

Before you start the hadoop server, make sure you format the distributed filesystem for the name node:

bin/hadoop namenode -format

Nowthat we have our hadoop configured and our slaves file configured it istime to start up hadoop on a single node and test that it is workingproperly. To start up all of the hadoop servers on the local machine(name node, data node, mapreduce tracker, job tracker) use thefollowing command as the nutch user:

cd /nutch/search
bin/start-all.sh

To stop all of the servers you would use the following command:

bin/stop-all.sh

Ifeverything has been setup correctly you should see output saying thatthe name node, data node, job tracker, and task tracker services havestarted. If this happens then we are ready to test out the filesystem.You can also take a look at the log files under /nutch/search/logs tosee output from the different daemons services we just started.

Totest the filesystem we are going to create a list of urls that we aregoing to use later for the crawl. Run the following commands:

cd /nutch/search
mkdir urls
vi urls/urllist.txt

http://lucene.apache.org

Youshould now have a urls/urllist.txt file with the one line pointing tothe apache lucene site. Now we are going to add that directory to thefilesystem. Later the nutch crawl will use this file as a list of urlsto crawl. To add the urls directory to the filesystem run the followingcommand:

cd /nutch/search
bin/hadoop dfs -put urls urls

Youshould see output stating that the directory was added to thefilesystem. You can also confirm that the directory was added by usingthe ls command:

cd /nutch/search
bin/hadoop dfs -ls

Somethinginteresting to note about the distributed filesystem is that it is userspecific. If you store a directory urls under the filesystem with thenutch user, it is actually stored as /user/nutch/urls. What this meansto us is that the user that does the crawl and stores it in thedistributed filesystem must also be the user that starts the search, orno results will come back. You can try this yourself by logging in witha different user and runing the ls command as shown. It won't find thedirectories because is it looking under a different directory/user/username instead of /user/nutch.

If everything worked then you are good to add other nodes and start the crawl.

Deploy Nutch to Multiple Machines

Onceyou have got the single node up and running we can copy theconfiguration to the other slave nodes and setup those slave nodes tobe started out start script. First if you still have the serversrunning on the local node stop them with the stop-all script.

Tocopy the configuration to the other machines run the following command.If you have followed the configuration up to this point, things shouldgo smoothly:

cd /nutch/search
scp -r /nutch/search/* nutch@computer:/nutch/search

Dothis for every computer you want to use as a slave node. Then edit theslaves file, adding each slave node name to the file, one per line. Youwill also want to edit the hadoop-site.xml file and change the valuesfor the map and reduce task numbers, making this a multiple of thenumber of machines you have. For our system which has 6 data nodes Iput in 32 as the number of tasks. The replication property can also bechanged at this time. A good starting value si something like 2 or 3.*(see Note at bottom about possibly having to clear filesystem of newdatanodes). Once this is done you should be able to startup all of thenodes.

To start all of the nodes we use the exact same command as before:

cd /nutch/search
bin/start-all.sh

Acommand like 'bin/slaves.sh uptime' is a good way to test that thingsare configured correctly before attempting to call the start-all.shscript.

Thefirst time all of the nodes are started there may be the ssh dialogasking to add the hosts to the known_hosts file. You will have to typein yes for each one and hit enter. The output may be a little wierd thefirst time but just keep typing yes and hitting enter if the dialogskeep appearing. You should see output showing all the servers startingon the local machine and the job tracker and data nodes serversstarting on the slave nodes. Once this is complete we are ready tobegin our crawl.

Performing a Nutch Crawl

Nowthat we have the the distributed file system up and running we canpeform our nutch crawl. In this tutorial we are only going to crawl asingle site. I am not as concerned with someone being able to learn thecrawling aspect of nutch as I am with being able to setup thedistributed filesystem and mapreduce.

Tomake sure we crawl only a single site we are going to edit crawlurlfilter file as set the filter to only pickup lucene.apache.org:

cd /nutch/search
vi conf/crawl-urlfilter.txt

change the line that reads:   +^http://([a-z0-9]*/.)*MY.DOMAIN.NAME/
to read:                      +^http://([a-z0-9]*/.)*apache.org/

Wehave already added our urls to the distributed filesystem and we haveedited our urlfilter so now it is time to begin the crawl. To start thenutch crawl use the following command:

cd /nutch/search
bin/nutch crawl urls -dir crawled -depth 3

Weare using the nutch crawl command. The urls is the urls directory thatwe added to the distributed filesystem. The -dir crawled is the outputdirectory. This will also go to the distributed filesystem. The depthis 3 meaning it will only get 3 page links deep. There are otheroptions you can specify, see the command documentation for thoseoptions.

Youshould see the crawl startup and see output for jobs running and mapand reduce percentages. You can keep track of the jobs by pointing youbrowser to the master name node:

http://devcluster01:50030

Youcan also startup new terminals into the slave machine and tail the logfiles to see detailed output for that slave node. The crawl willprobably take a while to complete. When it is done we are ready to dothe search.

Performing a Search

Toperform a search on the index we just created within the distributedfilesystem we need to do two things. First we need to pull the index toa local filesystem and second we need to setup and configure the nutchwar file. Although technically possible, it is not advisable to dosearching using the distributed filesystem.

The DFS is great for holding the results of the MapReduceprocesses including the completed index, but for searching it simplytakes too long. In a production system you are going to want to createthe indexes using the MapReducesystem and store the result on the DFS. Then you are going to want tocopy those indexes to a local filesystem for searching. If the indexesare too big (i.e. you have a 100 million page index), you are going towant to break the index up into multiple pieces (1-2 million pageseach), copy the index pieces to local filesystems from the DFS and havemultiple search servers read from those local index pieces. A fulldistributed search setup is the topic of another tutorial but for nowrealize that you don't want to search using DFS, you want to searchusing local filesystems.

Oncethe index has been created on the DFS you can use the hadoopcopyToLocal command to move it to the local file system as such.

bin/hadoop dfs -copyToLocal crawled /d01/local/

Yourcrawl directory should have an index directory which should contain theactual index files. Later when working with Nutch and Hadoop if youhave an indexes directory with folders such as part-xxxxx inside of ityou can use the nutch merge command to merge segment indexes into asingle index. The search website when pointed to local will look for adirectory in which there is an index folder that contains merged indexfiles or an indexes folder that contains segment indexes. This can be atricky part because your search website can be working properly but ifit doesn't find the indexes, all searches will return nothing.

Ifyou setup the tomcat server as we stated earlier then you should have atomcat installation under /nutch/tomcat and in the webapps directoryyou should have a folder called ROOT with the nutch war file unzippedinside of it. Now we just need to configure the application to use thedistributed filesystem for searching. We do this by editing thehadoop-site.xml file under the WEB-INF/classes directory. Use thefollowing commands:

cd /nutch/tomcat/webapps/ROOT/WEB-INF/classes
vi nutch-site.xml

Below is an template nutch-site.xml file:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>
    <name>fs.default.name</name>
    <value>local</value>
  </property>

  <property>
    <name>searcher.dir</name>
    <value>/d01/local/crawled</value>
  </property>

</configuration>

Thefs.default.name property is now pointed locally for searching the localindex. Understand that at this point we are not using the DFS or MapReduce to do the searching, all of it is on a local machine.

Thesearcher.dir directory is the directory where the index and resultingdatabase are stored on the local filesystem. In our crawl commandearlier we used the crawled directory which stored the results incrawled on the DFS. Then we copied the crawled folder to our /d01/localdirectory on the local fileystem. So here we point this property to/d01/local/crawled. The directory which it points to should contain notjust the index directory but also the linkdb, segments, etc. All ofthese different databases are used by the search. This is why we copiedover the crawled directory and not just the index directory.

Oncethe nutch-site.xml file is edited then the application should be readyto go. You can start tomcat with the following command:

cd /nutch/tomcat
bin/startup.sh

Then point you browser to http://devcluster01:8080(your search server) to see the Nutch search web application. Ifeverything has been configured correctly then you should be able toenter queries and get results. If the website is working but you aregetting no results it probably has to do with the index directory notbeing found. The searcher.dir property must be pointed to the parent ofthe index directory. That parent must also contain the segments,linkdb, and crawldb folders from the crawl. The index folder must benamed index and contain merged segment indexes, meaning the index filesare in the index directory and not in a directory below index namedpart-xxxx for example, or the index directory must be named indexes andcontain segment indexes of the name part-xxxxx which hold the indexfiles. I have had better luck with merged indexes than with segmentindexes.

Distributed Searching

Althoughnot really the topic of this tutorial, distributed searching needs tobe addressed. In a production system, you would create your indexes andcorresponding databases (i.e. crawldb) using the DFS and MapReduce, but you would search them using local filesystems on dedicated search servers for speed and to avoid network overhead.

Brieflyhere is how you would setup distributed searching. Inside of the tomcatWEB-INF/classes directory in the nutch-site.xml file you would pointthe searcher.dir property to a file that contains a search-servers.txtfile. The search servers.txt file would look like this.

devcluster01 1234
devcluster01 5678
devcluster02 9101

Eachline contains a machine name and port that represents a search server.This tells the website to connect to search servers on those machinesat those ports.

Oneach of the search servers, since we are searching local directories tosearch, you would need to make sure that the filesystem in thenutch-site.xml file is pointing to local. One of the problems that Ican across is that I was using the same nutch distribution to act as aslave node for DFS and MR as I was using to run the distributed searchserver. The problem with this was that when the distributed searchserver started up it was looking in the DFS for the files to read. Itcouldn't find them and I would get log messages saying x servers with 0segments.

Ifound it easiest to create another nutch distribution in a separatefolder. I would then start the distributed search server from thisseparate distribution. I just used the default nutch-site.xml andhadoop-site.xml files which have no configuration. This defaults thefilesystem to local and the distributed search server is able to findthe files it needs on the local box.

Whateverway you want to do it, if your index is on the local filesystem thenthe configuration needs to be pointed to use the local filesystem asshow below. This is usually set in the hadoop-site.xml file.

<property>
 <name>fs.default.name</name>
  <value>local</value>
  <description>The name of the default file system.  Either the
  literal string "local" or a host:port for DFS.</description>
</property>

Oneach of the search servers you would use the startup the distributedsearch server by using the nutch server command like this:

bin/nutch server 1234 /d01/local/crawled

Thearguments are the port to start the server on which must correspondwith what you put into the search-servers.txt file and the localdirectory that is the parent of the index folder. Once the distributedsearch servers are started on each machine you can startup the website.Searching should then happen normally with the exception of searchresults being pulled from the distributed search server indexes. In thelogs on the search website (usually catalina.out file), you should seemessages telling you the number of servers and segments the website isattached to and searching. This will allow you to know if you have yoursetup correct.

Thereis no command to shutdown the distributed search server process, youwill simply have to kill it by hand. The good news is that the websitepolls the servers in its search-servers.txt file to constantly check ifthey are up so you can shut down a single distributed search server,change out its index and bring it back up and the website willreconnect automatically. This was they entire search is never down atany one point in time, only specific parts of the index would be down.

Ina production environment searching is the biggest cost both in machinesand electricity. The reason is that once an index piece gets beyondabout 2 million pages it takes too much time to read from the disk soyou can have a 100 million page index on a single machine no matter howbig the hard disk is. Fortunately using the distributed searching youcan have multiple dedicated search servers each with their own piece ofthe index that are searched in parallel. This allow very large indexsystem to be searched efficiently.

Doingthe math, a 100 million page system would take about 50 dedicatedsearch servers to serve 20+ queries per second. One way to get aroundhaving to have so many machines is by using multi-processor machinewith multiple disks running multiple search servers each using aseparate disk and index. Going down this route you can cut machine costdown by as much as 50% and electricity costs down by as much as 75%. Amulti-disk machine can't handle the same number of queries per secondas a dedicated single disk machine but the number of index pages it canhandle is significantly greater so it averages out to be much moreefficient.

Rsyncing Code to Slaves

Nutchand Hadoop provide the ability to rsync master changes to the slavenodes. This is optional though because it slows down the startup of theservers and because you might not want to have changed automaticallysynced to slave nodes.

Ifyou do want this capability enabled then below I will show you how toconfigure your servers to rsync from the master. There are a couple ofthings you should know first. One, even though the slave nodes canrsync from the master you still have to copy the base installation overto the slave node the first time so that the scripts are available torsync. This is the way we did it above so that shouldn't require anychangeds Two the way the rsync happens is that the master node does anssh into the slave node and calls bin/hadoop-daemon.sh. The script onthe slave node then calls the rsync back to the master node. What thismeans is that you have to have a password-less login from each of theslave nodes to the master node. Before we setup password-less loginfrom the master to the slaves, now we need to do the reverse. Three, ifyou have problems with the rsync options (I did and I had to change theoptions because I am running an older version of ssh), look in thebin/hadoop-daemon.sh script around line 82 for where it calls the rsynccommand.

Sothe first thing we need to do is setup the hadoop master variable inthe conf/hadoop-env.sh file. The variable will need to look like this:

export HADOOP_MASTER=devcluster01:/nutch/search

This will need to be copied to all of the slave nodes like this:

scp /nutch/search/conf/hadoop-env.sh nutch@devcluster02:/nutch/search/conf/hadoop-env.sh

Andfinally you will need to log into each of the slave nodes, create adefault ssh key for each machine and then copy it back to the masternode where you will append it to the /nutch/home/.ssh/authorized_keysfile. Here are the commands for each slave node, be sure to change theslavenodename when you copy the key file back to the master node so youdon't overwrite files:

ssh -l nutch devcluster02
cd /nutch/home/.ssh

ssh-keygen -t rsa (Use empty responses for each prompt)
  Enter passphrase (empty for no passphrase): 
  Enter same passphrase again: 
  Your identification has been saved in /nutch/home/.ssh/id_rsa.
  Your public key has been saved in /nutch/home/.ssh/id_rsa.pub.
  The key fingerprint is:
  a6:5c:c3:eb:18:94:0b:06:a1:a6:29:58:fa:80:0a:bc nutch@localhost

scp id_rsa.pub nutch@devcluster01:/nutch/home/devcluster02.pub

Once you have done that for each of the slave nodes you can append the files to the authorized_keys file on the master node:

cd /nutch/home
cat devcluster*.pub >> .ssh/authorized_keys

Withthis setup whenever you run the bin/start-all.sh script files should besynced from the master node to each of the slave nodes.

Conclusion

Iknow this has been a lengthy tutorial but hopefully it has gotten youfamiliar with both nutch and hadoop. Both Nutch and Hadoop arecomplicated applications and setting them up as you have learned is notnecessarily an easy task. I hope that this document has helped to makeit easier for you.

If you have any comments or suggestions feel free to email them to me at nutch-dev@dragonflymc.com.If you have questions about Nutch or Hadoop they should be addressed totheir respective mailing lists. Below are general resources that arehelpful with operating and developing Nutch and Hadoop.

Updates

Idon't use rsync to sync code between the servers any more. Now I amusing expect scripts and python scripts to manage and automate thesystem.
Iuse distributed searching with 1-2 million pages per index piece. Wenow have servers with multiple processors and multiple disks (4 permachine) running multiple search servers (1 per disk) to decrease costand power requirements. With this a single server holding 8 millionpages can serve 10 queries a second constant.

Resources

Google MapReduce Paper: If you want to understand more about the MapReduce architecture used by Hadoop it is useful to read about the Google implementation.

http://labs.google.com/papers/mapreduce.html

Google File System Paper: Ifyou want to understand more about the Hadoop Distributed Filesystemarchitecture used by Hadoop it is useful to read about the GoogleFilesystem implementation.

http://labs.google.com/papers/gfs.html

Building Nutch - Open Source Search: A useful paper co-authored by Doug Cutting about open source search and Nutch in paticular.

http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=144

Hadoop 0.1.2-dev API:

http://www.netlikon.de/docs/javadoc-hadoop-0.1/overview-summary.html

- I, StephenHalsey,have used this tutorial and found it very useful, but when I tried toadd additional datanodes I got error messages in the logs of thosedatanodes saying "2006-07-07 18:58:18,345 INFO org.apache.hadoop.dfs.DataNode: Exception: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.UnregisteredDatanodeException:Data node linux89-1:50010is attempting to report storage IDDS-1437847760. Expecting DS-1437847760.". I think this was because thehadoop/filesystem/data/storage file was the same on the new data nodesand they had the same data as the one that had been copied from theoriginal. To get round this I turned everything off usingbin/stop-all.sh on the name-node and deleted everything in the/filesystem directory on the new datanodes so they were clean and ranbin/start-all.sh on the namenode and then saw that the filesystem onthe new datanodes had been created with newhadoop/filesystem/data/storage files and new directories and everythingseemed to work fine from then on. This probably is not a problem if youdo follow the above process without starting any datanodes because theywill all be empty, but was for me because I put some data onto the dfsof the single datanode system before copying it all onto the newdatanodes. I am not sure if I made some other error in following thisprocess, but I have just added this note in case people who read thisdocument experience the same problem. Well done for the tutorial by theway, very helpful. Steve.

nicetutorial! I tried to set it up without having fresh boxes available,just for testing (nutch 0.8). I ran into a few problems. But I finallygot it to work. Some gotchas:
- useabsolute paths for the DFS locations. Sounds strange that I used this,but I wanted to set up a single hadoop node on my Windows laptop, thenextend on a Linux box. So relative path names would have come in handy,as they would be the same for both machines. Don't try that. Won'twork. The DFS showed a ".." directory which disappeared when I switchedto absolute paths.
- Ihad problems getting DFS to run on Windows at all. I always ended upgetting this exception: "Could not complete write to filee:/dev/nutch-0.8/filesystem/mapreduce/system/submit_2twsuj/.job.jar.crcby DFSClient_-1318439814 - seems nutch hasn't been tested much onWindows. So, use Linux.
- don'tuse DFS on an NFS mount (this would be pretty stupid anyway, but justfor testing, one might just set it up into an NFS homre directory). DFSuses locks, and NFS may be configured to not allow them.
- When you first start up hadoop, there's a warning in the namenode log, "dfs.StateChange- DIR* FSDirectory.unprotectedDelete: failed to removee:/dev/nutch-0.8/filesystem/mapreduce/.system.crc because it does notexist" - You can ignore that.
- Ifyou get errors like, "failed to create file [...] on client [foo]because target-length is 0, below MIN_REPLICATION (1)" this means ablock could not be distributed. Most likely there is no datanoderunning, or the datanode has some severe problem (like the lock problemmentioned above).

Thistutorial worked well for me, however, I ran into a problem where mycrawl wasn't working. Turned out, it was because I needed to set theuser agent and other properties for the crawl. If anyone is readingthis, and running into the same problem, look at the updated tutorial http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial?highlight=%28hadoop%29%7C%28tutorial%29

Bydefault Nutch will read only the first 100 links on a page. This willresult in incomplete indexes when scanning file trees. So I set the"max outlinks per page" option to -1 in nutch-site.conf and gotcomplete indexes.

<property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>