hadoop ubuntu
来源:互联网 发布:棋牌游戏编译源码图片 编辑:程序博客网 时间:2024/05/03 23:12
Michael G. Noll
Applied Research. Big Data. Distributed Systems. Open Source.
Running Hadoop on Ubuntu Linux (Single-Node Cluster)
In this tutorial I will describe the required steps for setting up a pseudo-distributed, single-nodeHadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.
The main goal of this tutorial is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it.
This tutorial has been tested with the following software versions:
- Ubuntu Linux 10.04 LTS (deprecated: 8.10 LTS, 8.04, 7.10, 7.04)
- Hadoop 1.0.3, released May 2012
Prerequisites
Sun Java 6
Hadoop requires a working Java 1.5+ (aka Java 5) installation. However, using Java 1.6 (aka Java 6) is recommended for running Hadoop. For the sake of this tutorial, I will therefore describe the installation of Java 1.6.
The full JDK which will be placed in /usr/lib/jvm/java-6-sun
(well, this directory is actually a symlink on Ubuntu).
After installation, make a quick check whether Sun’s JDK is correctly set up:
Adding a dedicated Hadoop system user
We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).
This will add the user hduser
and the group hadoop
to your local machine.
Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost
for the hduser
user we created in the previous section.
I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication. If not, there are several online guides available.
First, we have to generate an SSH key for the hduser
user.
The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).
Second, you have to enable SSH access to your local machine with this newly created key.
The final step is to test the SSH setup by connecting to your local machine with the hduser
user. The step is also needed to save your local machine’s host key fingerprint to the hduser
user’s known_hosts
file. If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config
(see man ssh_config
for more information).
If the SSH connect should fail, these general tips might help:
- Enable debugging with
ssh -vvv localhost
and investigate the error in detail. - Check the SSH server configuration in
/etc/ssh/sshd_config
, in particular the optionsPubkeyAuthentication
(which should be set toyes
) andAllowUsers
(if this option is active, add thehduser
user to it). If you made any changes to the SSH server configuration file, you can force a configuration reload withsudo /etc/init.d/ssh reload
.
Disabling IPv6
One problem with IPv6 on Ubuntu is that using 0.0.0.0
for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of my Ubuntu box. In my case, I realized that there’s no practical point in enabling IPv6 on a box when you are not connected to any IPv6 network. Hence, I simply disabled IPv6 on my Ubuntu machine. Your mileage may vary.
To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf
in the editor of your choice and add the following lines to the end of the file:
You have to reboot your machine in order to make the changes take effect.
You can check whether IPv6 is enabled on your machine with the following command:
A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that’s what we want).
Alternative
You can also disable IPv6 only for Hadoop as documented in HADOOP-3437. You can do so by adding the following line to conf/hadoop-env.sh
:
Hadoop
Installation
Download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop
. Make sure to change the owner of all the files to the hduser
user and hadoop
group, for example:
(Just to give you the idea, YMMV – personally, I create a symlink from hadoop-1.0.3
to hadoop
.)
Update $HOME/.bashrc
Add the following lines to the end of the $HOME/.bashrc
file of user hduser
. If you use a shell other than bash, you should of course update its appropriate configuration files instead of .bashrc
.
You can repeat this exercise also for other users who want to use Hadoop.
Excursus: Hadoop Distributed File System (HDFS)
Before we continue let us briefly learn a bit more about Hadoop’s distributed file system.
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop project, which is part of the Apache Lucene project.
The Hadoop Distributed File System: Architecture and Designhadoop.apache.org/hdfs/docs/…
The following picture gives an overview of the most important HDFS components.
Configuration
Our goal in this tutorial is a single-node setup of Hadoop. More information of what we do in this section is available on the Hadoop Wiki.
hadoop-env.sh
The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME
. Open conf/hadoop-env.sh
in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/hadoop-env.sh
) and set the JAVA_HOME
environment variable to the Sun JDK/JRE 6 directory.
Change
find .|xargs grep -ri "IBM" -l
to
Note: If you are on a Mac with OS X 10.7 you can use the following line to set up JAVA_HOME
in conf/hadoop-env.sh
.
conf/*-site.xml
In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine.
You can leave the settings below “as is” with the exception of the hadoop.tmp.dir
parameter – this parameter you must change to a directory of your choice. We will use the directory /app/hadoop/tmp
in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir
as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.
Now we create the directory and set the required ownerships and permissions:
If you forget to set the required ownerships and permissions, you will see a java.io.IOException
when you try to format the name node in the next section).
Add the following snippets between the <configuration> ... </configuration>
tags in the respective configuration XML file.
In file conf/core-site.xml
:
In file conf/mapred-site.xml
:
In file conf/hdfs-site.xml
:
See Getting Started with Hadoop and the documentation in Hadoop’s API Overview if you have any questions about Hadoop’s configuration options.
Formatting the HDFS filesystem via the NameNode
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir
variable), run the command
The output will look like this:
Starting your single-node cluster
Run the command:
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
The output will look like this:
A nifty tool for checking whether the expected Hadoop processes are running is jps
(part of Sun’s Java since v1.5.0). See also How to debug MapReduce programs.
You can also check with netstat
if Hadoop is listening on the configured ports.
If there are any errors, examine the log files in the /logs/
directory.
Stopping your single-node cluster
Run the command
to stop all the daemons running on your machine.
Example output:
Running a MapReduce job
We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. More information of what happens behind the scenes is available at the Hadoop Wiki.
Download example input data
We will use three ebooks from Project Gutenberg for this example:
- The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
- The Notebooks of Leonardo Da Vinci
- Ulysses by James Joyce
Download each ebook as text files in Plain Text UTF-8
encoding and store the files in a local temporary directory of choice, for example /tmp/gutenberg
.
Restart the Hadoop cluster
Restart your Hadoop cluster if it’s not running already.
Copy local example data to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.
Run the MapReduce job
Now, we actually run the WordCount example job.
This command will read all the files in the HDFS directory /user/hduser/gutenberg
, process it, and store the result in the HDFS directory /user/hduser/gutenberg-output
.
Exception in thread "main" java.io.IOException: Error opening job jar: hadoop*examples*.jarat org.apache.hadoop.util.RunJar.main (RunJar.java: 90)Caused by: java.util.zip.ZipException: error in opening zip fileIn this case, re-run the command with the full name of the Hadoop Examples JAR file, for example:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-1.0.3.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
Example output of the previous command in the console:
Check if the result is successfully stored in HDFS directory /user/hduser/gutenberg-output
:
If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the "-D"
option:
An important note about mapred.map.tasks: Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but you can specify mapred.reduce.tasks.
Retrieve the job result from HDFS
To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you can use the command
to read the file directly from HDFS without copying it to the local file system. In this tutorial, we will copy the results to the local file system though.
Note that in this specific output the quote signs (“) enclosing the words in the head
output above have not been inserted by Hadoop. They are the result of the word tokenizer used in the WordCount example, and in this case they matched the beginning of a quote in the ebook texts. Just inspect the part-00000
file further to see it for yourself.
The command fs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) not be sorted.
Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml
) available at these locations:
- http://localhost:50070/ – web UI of the NameNode daemon
- http://localhost:50030/ – web UI of the JobTracker daemon
- http://localhost:50060/ – web UI of the TaskTracker daemon
These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.
NameNode Web Interface (HDFS layer)
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.
By default, it’s available at http://localhost:50070/.
JobTracker Web Interface (MapReduce layer)
The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on).
By default, it’s available at http://localhost:50030/.
TaskTracker Web Interface (MapReduce layer)
The task tracker web UI shows you running and non-running tasks. It also gives access to the ‘‘local machine’s’’ Hadoop log files.
By default, it’s available at http://localhost:50060/.
What’s next?
If you’re feeling comfortable, you can continue your Hadoop experience with my follow-up tutorial Running Hadoop On Ubuntu Linux (Multi-Node Cluster) where I describe how to build a Hadoop ‘‘multi-node’’ cluster with two Ubuntu boxes (this will increase your current cluster size by 100%, heh).
In addition, I wrote a tutorial on how to code a simple MapReduce job in the Python programming language which can serve as the basis for writing your own MapReduce programs.
Related Links
From yours truly:
- Running Hadoop On Ubuntu Linux (Multi-Node Cluster)
- Writing An Hadoop MapReduce Program In Python
From other people:
- How to debug MapReduce programs
- Hadoop API Overview (for Hadoop 2.x)
Change Log
Only important changes to this article are listed here:
- 2011-07-17: Renamed the Hadoop user from
hadoop
tohduser
based on readers’ feedback. This should make the distinction between the local Hadoop user (nowhduser
), the local Hadoop group (hadoop
), and the Hadoop CLI tool (hadoop
) more clear.
Hadoop 2.2.0已经发布,网上有很多教程说明安装的过程,不过总有些问题存在……这里把我安装的过程分享一下。
建议采用Ubuntu 12.04.3 LTS Server x64版本或者其他长期支持版本。
另外,如果是虚拟机环境,使用32位版本的Ubuntu即可。建议在安装前,将系统升级到最新:
- $sudo apt-get update
- $sudo apt-get upgrade
准备工作
java
建议采用Oracle Java JDK6以上版本。
- $chmod 755 jdk-7u45-linux-x64.bin
- ./jdk-7u45-linux-x64.bin
- mvjdk1.7.0_45 /opt
则JAVA_HOME地址可以如下配置,在/etc/profile中修改:
- $ sudo vim /etc/profile
- export JAVA_HOME=/opt/jdk1.7.0_45
- export JRE_HOME=$JAVA_HOME/jre
- export CLASSPATH=$CLASSPATH:$JAVA_HOME/lib:$JRE_HOME/lib
- export PATH=$PATH:$JAVA_HOME/bin
hadoop 2.2.0
- http://mirrors.cnnic.cn/apache/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz
Hadoop路径配置
- $sudo chown cloud:cloud /opt
- $ tar xzvf hadoop-2.2.0.tar.gz
- $ mv hadoop-2.2.0 /opt
- $ ln -s /opt/hadoop-2.2.0 /opt/hadoop
ssh本地互通
配置公钥,使用默认配置即可:
- $ sshgen
- $ cd ~/.ssh
- $ cat id_rsa.pub >> authorized_keys
然后执行以下命令:
- $ ssh localhost
全局变量
在/etc/profile的末尾加入:
- export HADOOP_HOME=/opt/hadoop
- export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
确保可以在任意位置执行hadoop命令。
然后将配置文件启用:
- $ source /etc/profile
目录配置
创建Hadoop的数据存储目录,并修改属主权限:
- $ sudo mkdir /hadoop
- $ sudo chown cloud:cloud /hadoop
- $ mkdir /hadoop/dfs
- $ mkdir /hadoop/tmp
配置Hadoop
配置hadoop-env.sh
- $ cd /opt/hadoop/etc/hadoop
- $ vim hadoop-env.sh
将默认的JAVA_HOME修改为:
- export JAVA_HOME=/opt/jdk1.7.0_45
配置core-site.xml
修改core-site.xml文件:
- $cd /opt/hadoop/etc/hadoop
- $vim core-site.xml
在<configuration>标签中(即嵌套在该标签中)加入以下内容:
- <property>
- <name>hadoop.tmp.dir</name>
- <value>/hadoop/tmp</value>
- <description>temporary directories.</description>
- </property>
- <property>
- <name>fs.defaultFS</name>
- <value>hdfs://192.168.1.100:9000</value>
- <description>The name of the defaultfile system. Either the literal string "local" or a host:port forNDFS.
- </description>
- <final>true</final>
- </property>
更多配置信息,参考:
http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/core-default.xml
配置hdfs-site.xml
修改hdfs-site.xml:
- $ vim hdfs-site.xml
在<configuration>标签中(即嵌套在该标签中)加入以下内容:
- <property>
- <name>dfs.namenode.name.dir</name>
- <value>file:/hadoop/dfs/name</value>
- <description>Determineswhere on the local filesystem the DFS name node should store the name table.</description>
- <final>true</final>
- </property>
- <property>
- <name>dfs.datanode.data.dir</name>
- <value>file:/hadoop/dfs/data</value>
- <description>Determineswhere on the local filesystem an DFS data node should store its blocks.
- </description>
- <final>true</final>
- </property>
- <property>
- <name>dfs.replication</name>
- <value>1</value>
- </property>
- <property>
- <name>dfs.permissions</name>
- <value>false</value>
- </property>
更多hdfs-site.xml的配置信息,参考:
http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml配置mapred-site.xml
默认不存在此文件,需要创建:
- $ cp mapred-site.xml.template mapred-site.xml
- $ vim mapred-site.xml
在<configuration>标签中(即嵌套在该标签中)加入以下内容:
- <property>
- <name>mapreduce.framework.name</name>
- <value>yarn</value>
- </property>
- <property>
- <name>mapred.system.dir</name>
- <value>file:/hadoop/mapred/system</value>
- <final>true</final>
- </property>
- <property>
- <name>mapred.local.dir</name>
- <value>file:/hadoop/mapred/local</value>
- <final>true</final>
- </property>
配置yarn-site.xml
执行以下命令:
- $ vim yarn-site.xml
在<configuration>标签中(即嵌套在该标签中)加入以下内容:
- <property>
- <name>yarn.nodemanager.aux-services</name>
- <value>mapreduce_shuffle</value>
- <description>shuffle service that needsto be set for Map Reduce to run </description>
- </property>
- <property>
- <name>yarn.resourcemanager.hostname</name>
- <value>192.168.1.100</value>
- <description>hostanem of RM</description>
- </property>
yarn.resourcemanager.hostname配置后,其他端口号将使用默认。详见:
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml初始化
格式化NameNode:
- $ hdfs namenode -format
启动DFS
- $ hadoop-daemon.sh start namenode
- $ hadoop-daemon.sh start datanode
使用jps查看进程是否启动:
- $ jps
并在以下网页检查:
- http://202.117.16.170:50070/dfshealth.jsp
启动Yarn
- $ yarn-daemon.sh start resourcemanager
- $ yarn-daemon.sh start nodemanager
异常处理
- $ rm -rf /hadoop/dfs/*
- $ rm -rf /hadoop/tmp/*
- $ hdfs namenode -format
JDK+MySQL+Tomcat+Eclipse+MyEclipse+Hadoop+Mahout Installation on Ubuntu 12.04
Mahout integrates a lot of common machine learning algorithms which faciliates those who want to do some research in data mining. It is based on Java and a lot of need to be done before you can make it work. At least you will need JDK, Eclispse, Hadoop and Mahout. But I strongly recommend all those below to be done to make it better.
I JDK
II mysql
III Tomcat
IV Eclipse and MyEclipse
V Maven
VI Hadoop and Mahout
VII Test
VIII k-means Algorithm Test
I JDK
sudo gedit /etc/profile
#set java environment
JAVA_HOME=/home/lethic/Documents/Softwares/jdk1.7.0_21
export JRE_HOME=/home/lethic/Documents/Softwares/jdk1.7.0_21/jre
export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH
Reboot
Test:vim hello.java
public class hello{
public static void main(String args[]){
System.out.println("Hello World!");
}
}
Javac hello.java
Java hello
II mysqlsudo apt-get install mysql-server my-client
And test:
sudo netstat -tap | grep mysql
A graphical tool is recommended. Search for mysql-admin in Synaptic and install it:
III Tomcat
http://mirror.bjtu.edu.cn/apache/tomcat/tomcat-7/v7.0.40/bin/
apache-tomcat-7.0.40.tar.gz
Add this:JAVA_HOME=/home/lethic/Documents/Softwares/jdk1.7.0_21
JAVA_OPTS="-server -Xms512m -Xmx1024m -XX:PermSize=600M -XX:MaxPermSize=600m -Dcom.sun.management.jmxremote"
Infront of:cygwin=false
os400=false
darwin=false
case "`uname`" in
CYGWIN*) cygwin=true;;
OS400*) os400=true;;
Darwin*) darwin=true;;
Add:JAVA_HOME=/home/lethic/Documents/Softwares/jdk1.7.0_21
export JRE_HOME=/home/lethic/Documents/Softwares/jdk1.7.0_21/jre
export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH
To the end
then
Type: localhost:8080 in your browser
IV Eclipse and MyEclipse
http://www.eclipse.org/downloads/
I chose the fist one
Myeclipse:
Modify default jdk:
sudo update-alternatives --install "/usr/bin/java" "java" "/home/lethic/Documents/Softwares/jdk1.7.0_21/bin/java" 300
sudo update-alternatives --install "/usr/bin/javac" "javac" "/home/lethic/Documents/Softwares/jdk1.7.0_21/bin/javac" 300
sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/home/lethic/Documents/Softwares/jdk1.7.0_21/bin/javaws" 300
sudo update-alternatives --config java
sudo update-alternatives --config javac
sudo update-alternatives --config javaws
Download:
http://www.myeclipseide.com/module-htmlpages-display-pid-4.html
Build a shortcut for MyEclipselethic@lethic:~/Documents/Softwares$ sudo chown -R root:root MyEclispse
lethic@lethic:~/Documents/Softwares$ sudo chmod -R +r MyEclispse
lethic@lethic:~/Documents/Softwares$ cd 'MyEclispse/MyEclipse 10/'
lethic@lethic:~/Documents/Softwares/MyEclispse/MyEclipse 10$ sudo chown -R root:root myeclipse
lethic@lethic:~/Documents/Softwares/MyEclispse/MyEclipse 10$ sudo chmod -R +r myeclipse
sudo gedit /usr/bin/MyEclipse
#!/bin/sh
export MYECLIPSE_HOME="/home/lethic/Documents/Softwares/MyEclispse/MyEclipse 10/myeclipse"
$MYECLIPSE_HOME/myeclipse $*
sudo chmod 755 /usr/bin/MyEclipse
sudo chmod -R 777 /home/lethic/Documents/Softwares/MyEclispse
sudo gedit /usr/share/applications/MyEclipse.desktop
[Desktop Entry]
Encoding=UTF-8
Name=MyEclipse 10
Comment=IDE for JavaEE
Exec=/home/lethic/Documents/Softwares/MyEclispse/MyEclipse\ 10/myeclipse
Icon=/home/lethic/Documents/Softwares/MyEclispse /MyEclipse\ 10/icon.xpm
Terminal=false
Type=Application
Categories=GNOME;Application;Development;
StartupNotify=true
Then initialize it:'/usr/MyEclipse/MyEclipse 10/myeclipse' -clean
V Maven
Apache Maven 3.0.5
http://maven.apache.org/docs/3.0.5/release-notes.html
tar -xvzf apache-maven-3.0.5-bin.tar.gz
#create a link for it to make it easy to upgrade
ln -s apache-maven-3.0.5 apache-maven
#reboot and test
VI Hadoop and Mahout
Hadoop:
http://mirror.bit.edu.cn/apache/hadoop/common/stable/
hadoop-1.1.2.tar.gz
tar zxvf hadoop-1.1.2.tar.gzMahout:
http://mirror.bit.edu.cn/apache/mahout/0.6/
tar zxvf mahout-distribution-0.6.tar.gz
Add this to etc/profile
export HADOOP_HOME=/home/lethic/Documents/Softwares/hadoop-1.1.2
export HADOOP_CONF_DIR=/home/lethic/Documents/Softwares/hadoop-1.1.2/conf
export MAHOUT_HOME=/home/lethic/Documents/Softwares/mahout-distribution-0.6
export PATH=$HADOOP_HOME/bin:$MAHOUT_HOME/bin:$PATH
Then refresh the profile again:source /etc/profile
VII Test
I modified my /etc/profile again and finally the part I added in is like this:umask 022
#set java environment
#JAVA_HOME=/home/lethic/Documents/Softwares/jdk1.7.0_21
export JAVA_HOME=/home/lethic/Documents/Softwares/jdk1.7.0_21
export JRE_HOME=/home/lethic/Documents/Softwares/jdk1.7.0_21/jre
#export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH
export GTK_IM_MODULE=ibus
export XMODIFIERS="@im=ibus"
export QT_IM_MODULE=ibus
export MAVEN_HOME=/home/lethic/Documents/Softwares/apache-maven-3.0.5
export HADOOP_HOME=/home/lethic/Documents/Softwares/hadoop-1.1.2
export HADOOP_CONF_DIR=/home/lethic/Documents/Softwares/hadoop-1.1.2/conf
export MAHOUT_HOME=/home/lethic/Documents/Softwares/mahout-distribution-0.6
export PATH=$JAVA_HOME/bin:$MAVEN_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$MAHOUT_HOME/bin:$PATH
export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
export HADOOP_HOME_WARN_SUPPRESS=1
NOTICE that all the “/home/lethic/Documents/Softwares/” should be changed to your own path.
TEST:
Java:
javac
Remember to add this to etc/profile or it will show some warning:
export HADOOP_HOME_WARN_SUPPRESS=1
Hadoop:
Mahout:
It says that: MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath
I think this not a kind of error because when you refer to mahout, it contains:
if [ "$MAHOUT_LOCAL" != "" ]; then
echo "MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath."
else
echo "MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath."
CLASSPATH=${CLASSPATH}:$HADOOP_CONF_DIR
Fi
Which means whenever MAHOUT_LOCAL is not empty, it will echo “MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.”.
And notice that:
# MAHOUT_LOCAL set to anything other than an empty string to force
# mahout to run locally even if
# HADOOP_CONF_DIR and HADOOP_HOME are set
Which means if you want to run Mahout on Hadoop but not locally, you should set MAHOUT_LOCAL to empty string.
Thus we may get a conclusion that if we want to run Mahout on Hadoop, it will always echo “MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.” which is not a kind of error.
And all above is my opinion and it may be wrong because I’m still fledgling. But at least all the things still goes well and I did not met any problem since then.
VIII k-means Algorithm Test
Test k-means:
Download the data:
http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
And copy it to $MAHOUT_HOME
Get the Hadoop started:$HADOOP_HOME/bin/start-all.sh
Then import the data to ‘testdata’(NOTICE that the name ‘testdata’ cannot be modified, it is said on the Internet that only the name ‘testdata’ can be detected by this program):$HADOOP_HOME/bin/hadoop fs -mkdir testdata
$HADOOP_HOME/bin/hadoop fs -put $MAHOUT_HOME/synthetic_control.data $MAHOUT_ HOME/testdata
Kmeans algorithm:$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/mahout-examples-0.6-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
It will take a few minutes
To see the results:$HADOOP_HOME/bin/hadoop fs -lsr output
$HADOOP_HOME/bin/hadoop fs -get output $MAHOUT_HOME/examples
$cd $MAHOUT_HOME/examples/output
$ ls
And if you see:
clusteredPoints clusters-0 clusters-1 clusters-10 clusters-2 clusters-3 clusters-4
clusters-5 clusters-6 clusters-7 clusters-8 clusters-9 data
Your Mahout is properly installed.
1. Prerequisites:
- Java 6
- Dedicated unix user(hadoop) for hadoop
- SSH configured
- hadoop 2.x tarball ( hadoop-2.2.0.tar.gz )
2.Installation
$ tar -xvzf hadoop-2.2.0.tar.gz
$ mv hadoop-2.2.0 /home/hadoop/yarn/hadoop-2.2.0
$ cd /home/hadoop/yarn
$ sudo chown -R hadoop:hadoop hadoop-2.2.0
$ sudo chmod -R 755 hadoop-2.2.0
3. Setup Environment Variables in .bashrc
# Setup for Hadoop 2.0 .export HADOOP_HOME=$HOME/Programs/Hadoop/hadoop-2.2.0export HADOOP_MAPRED_HOME=$HOME/Programs/Hadoop/hadoop-2.2.0export HADOOP_COMMON_HOME=$HOME/Programs/Hadoop/hadoop-2.2.0export HADOOP_HDFS_HOME=$HOME/Programs/Hadoop/hadoop-2.2.0export YARN_HOME=$HOME/Programs/Hadoop/hadoop-2.2.0export HADOOP_CONF_DIR=$HOME/Programs/Hadoop/hadoop-2.2.0/etc/hadoop
After Adding these lines at bottom of the .bashrc file
$ source .bashrc
4. Create Hadoop Data Directories
# Two Directories for name node and datanode .$ mkdir -p $HOME/yarn/yarn_data/hdfs/namenode$ mkdir -p $HOME/yarn/yarn_data/hdfs/datanode
5. Configuration
# Base Directory .$ cd $YARN_HOME
Add the following contents inside configuration tag$ vi etc/hadoop/yarn-site.xml
# etc/hadoop/yarn-site.xml .<property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property><property><name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name><value>org.apache.hadoop.mapred.ShuffleHandler</value></property>
Add the following contents inside configuration tag$ vi etc/hadoop/core-site.xml
# etc/hadoop/core-site.xml .<property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value></property>
$ vi etc/hadoop/hdfs-site.xml
Add the following contents inside configuration tag# etc/hadoop/hdfs-site.xml . <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/hadoop/yarn/yarn_data/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/hadoop/yarn/yarn_data/hdfs/datanode</value> </property>
If this file does not exist, create it and paste the content provided below:$ vi etc/hadoop/mapred-site.xml
# etc/hadoop/mapred-site.xml .<?xml version="1.0"?><configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property></configuration>
6. Format namenode(Onetime Process)
# Command for formatting Name node.$ bin/hadoop namenode -format
7. Starting HDFS processes and Map-Reduce Process
# HDFS(NameNode & DataNode).$ sbin/hadoop-daemon.sh start namenode$ sbin/hadoop-daemon.sh start datanode
# MR(Resource Manager, Node Manager & Job History Server).$ sbin/yarn-daemon.sh start resourcemanager$ sbin/yarn-daemon.sh start nodemanager$ sbin/mr-jobhistory-daemon.sh start historyserver
8. Verifying Installation
$ jps
# Console Output.22844 Jps28711 DataNode29281 JobHistoryServer28887 ResourceManager29022 NodeManager28180 NameNode
Running Word count Example Program
$ mkdir input
$ cat > input/fileThis is word count exampleusing hadoop 2.2.0
$ bin/hadoop hdfs -copyFromLocal input /input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /input /output
http://localhost:50070
8. Verifying Installation
# Commands.$ sbin/hadoop-daemon.sh stop namenode$ sbin/hadoop-daemon.sh stop datanode$ sbin/yarn-daemon.sh stop resourcemanager$ sbin/yarn-daemon.sh stop nodemanager$ sbin/mr-jobhistory-daemon.sh stop historyserver
- hadoop ubuntu
- hadoop ubuntu
- hadoop ubuntu 安装
- ubuntu下部署hadoop
- ubuntu搭建hadoop
- ubuntu上hadoop配置
- hadoop on ubuntu
- Ubuntu下配置hadoop
- ubuntu 上安装 hadoop
- Ubuntu下hadoop部署
- ubuntu下安装hadoop
- ubuntu+myql+hadoop+mongodb
- ubuntu 12.04 安装hadoop
- ubuntu安装hadoop集群
- ubuntu 11配置hadoop
- ubuntu下hadoop安装
- Ubuntu 13.10 编译hadoop
- ubuntu 12.04 hadoop eclipse
- 骷髅大大新书《星战风暴》正式上线!
- 读书笔记-《基于Oracle的SQL优化》-第一章-3
- 检查你的电脑是否有中毒
- 6、面向接口_概念模型分析实例1
- pomelo 服务器之间的通信
- hadoop ubuntu
- 字符串模式匹配之BF算法
- 黑马程序员----------------io流复制文件
- 1325:算法2-3~2-6:Big Bang
- 在RakNet发包的时候使用 RakNet::UNASSIGNED_SYSTEM_ADDRESS
- RTMFP协议
- 敏捷开发产品管理系列之九:划分产品子系统
- 谈谈函数
- 文件操作三(select、poll多路复用)