RHADOOP
来源:互联网 发布:软件系统数据对接方案 编辑:程序博客网 时间:2024/05/16 01:28
Author:Lishu
Date: 2013-10-23
Weibo: @maggic_rabbit
最近接了一个project要使用R Hadoop,这才零零散散开始学习R和Hadoop,在这里记录下我学习和安装过程中遇到的问题和解决办法。我搭建的环境是14台server,全部安装Ubuntu Server 12. 04 LTS,关于系统重装这里就不赘述了。我的目标是搭建一个14台机器的Hadoop cluster,并装上R Hadoop。
1. Hadoop 的安装
关于Hadoop有太多open resource可以利用,这里我用的是O‘Reilly出版的Hadoop: The Definitive Guide, 3rd Edition. 这本书应该有相应的中文版,正如书名所说,这是最官方最全面的guide,里面基本上对Hadoop方方面面都有讲解。
关于Hadoop的安装,请参考书的Appendix A 和Chapter 9. 或者这个tutorial.
Running Hadoop on Ubuntu Linux (Single-Node Cluster)
Running Hadoop on Ubuntu Linux (Multi-Node Cluster)
具体步骤如下:
1.预安装
每台机器都需要Java和SSH,对于每台机器都需要做以下事情:
- 预装Java:
- $ sudo apt-get install openjdk-7-jdk
或查看Java版本:
- $ java -version
- java version "1.7.0_25"
- OpenJDK Runtime Environment (Ice(IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.04.2)
- OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
- 安装设置SSH:
理论上来说应该在每台机器上都生成一个RSA key,然后把 public key 添加到其它机器的 authorized_keys,这里为简化这个过程,我在每台机器上都使用了同样的RSA key pair, 即大家都用相同的钥匙去开相同的锁,但是这样是存在安全隐患的。
- $ sudo apt-get install ssh
- $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa # 生成无密码的RSA key
- $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys # 添加 public key 到 authorized_keys 以允许访问
- $ ssh localhost # 测试是否装成功
- 设置
/etc/hosts:
- 127.0.0.1 localhost
- #127.0.1.1 serv20
- 10.11.8.27 serv07
- 10.11.8.34 serv08
- 10.11.8.42 serv09
- 10.11.8.44 serv10
- 10.11.8.48 serv11
- 10.11.8.47 serv12
- 10.11.8.49 serv13
- 10.11.8.51 serv14
- 10.11.8.52 serv15
- 10.11.8.55 serv16
- 10.11.8.53 serv17
- 10.11.8.36 serv18
- 10.11.8.54 serv19
- 10.11.8.56 serv20
注意,我的14台机器 hostname 叫 serv07~20, 这里以 serv20 为例,添加每台机子的IP,并必须注释掉原本的 127.0.1.1, 否则在运行Hadoop时会出现类似这样的错误:
- 2013-09-03 15:29:04,374 INFO org.apache.hadoop.ipc.Client: <span style="color:#FF0000">Retrying connect to server: serv20/127.0.1.1:54310. </span>
- Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
- 2013-09-03 15:29:05,375 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: serv20/127.0.1.1:54310.
- Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
- 2013-09-03 15:29:06,376 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: serv20/127.0.1.1:54310.
- Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
- 2013-09-03 15:29:07,376 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: serv20/127.0.1.1:54310.
- Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
- 2013-09-03 15:29:08,377 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: serv20/127.0.1.1:54310.
- Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
- 2013-09-03 15:29:09,377 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: serv20/127.0.1.1:54310.
- Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
如果设置正确,你应该能从每台机器上直接用 hostname SSH:
- $ ssh serv19 # instead of ssh 10.11.8.54
2. 安装设置Hadoop
在 Apache Hadoop Releases 下载所需Hadoop版本,我使用的是Hadoop 1.2.1 (stable).
- 解压:
- $ tar xzf hadoop-1.2.1.tar.gz
- 设置环境变量:
- $ export HADOOP_HOME=/home/hadoop/hadoop-1.2.1 # 这里我的用户名叫hadoop
- $ export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
- $ export PATH=$PATH:$HADOOP_HOME/bin
注意,这里想 permanent 设置的话需要在~/.bashrc里面修改,这是个 shell startup file,通常都在 user home directory 下面。如果 home directory 没有的话可以从 root 拷贝过来:
- $ sudo cp /root/.bashrc ~
修改完 .bashrc 文件需要重新加载,或者重新登陆:
- $ source ~/.bashrc
- 设置
conf/hadoop-env.sh
,添加以下行:
- export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64<pre name="code" class="html"><pre><code class="bash"><span class="line"><span class="nb">export </span><span class="nv">HADOOP_OPTS</span><span class="o">=</span>-Djava.net.preferIPv4Stack<span class="o">=</span><span class="nb">true </span></span></code># disable IPv6<code class="bash"><span class="line"><span class="nb"></span>
- </span></code></pre></pre>
注意,虽然环境变量里已经设置过 JAVA_HOME, 在Hadoop里还需要再设置一下,否则会找不到 java.
- 设置
conf/core-site.xml
:
- <configuration>
- <property>
- <name>hadoop.tmp.dir</name>
- <value>/data/hadoop</value>
- </property>
- <property>
- <name>fs.default.name</name>
- <value>hdfs://serv20:54310</value> <pre name="code" class="html"> </property>
- </pre></configuration><br>
注意,这里我的serv20是我设定的namenode,serv20需要被替换成你想设定成namenode的 hostname / IP address。
* property 的 default value 和 description 参考core-default.
另外,这里的hadoop.tmp.dir必须是已经存在的,所以在此之前需要 mkdir 以下 directory 以备后用:
- $ mkdir -p /data
- $ mkdir -p /data/hadoop
- $ mkdir -p /data/hadoop/dfs
- $ mkdir -p /data/hadoop/dfs/data
- $ mkdir -p /data/hadoop/dfs/local
- $ mkdir -p /data/hadoop/dfs/name
- $ mkdir -p /data/hadoop/dfs/namesecondary
- $ mkdir -p /tmp/hadoop/mapred
- $ mkdir -p /tmp/hadoop/mapred/local
- $ chown -R hadoop:hadoop /data
- $ chown -R hadoop:hadoop /tmp/hadoop
- 设置
conf/mapred-site.xml
:
- <configuration>
- <property>
- <name>mapred.job.tracker</name>
- <value>serv20:54311</value>
- </property>
- <property>
- <name>mapred.local.dir</name>
- <value>/data/hadoop/dfs/local</value>
- </property>
- <property>
- <name>mapred.child.java.opts</name>
- <value>-Xmx512m</value>
- </property>
- <property>
- <name>mapred.system.dir</name>
- <value>/tmp/hadoop/mapred/local</value>
- </property>
- <property>
- <name>mapred.tasktracker.map.tasks.maximum</name>
- <value>4</value>
- </property>
- <property>
- <name>mapred.tasktacker.reduce.tasks.maximum</name>
- <value>4</value>
- </property>
- </configuration>
注意,这里我设定的 jobtracker 也是serv20, 但是 namenode 和 jobtracker 并不一定非得是同一台机子,我后面会提到。
* property 的 default value 和 description 参考mapred-default.
- 设置
conf/hdfs-site.xml
:
- <configuration>
- <property>
- <name>dfs.replication</name>
- <value>3</value>
- </property>
- <property>
- <name>fs.checkpoint.dir</name>
- <value>/data/hadoop/dfs/namesecondary</value>
- </property>
- <property>
- <name>fs.chechpoint.size</name>
- <value>1048580</value>
- </property>
- <property>
- <name>dfs.http.address</name>
- <value>hdfs://serv19:50070</value>
- </property>
- <property>
- <name>dfs.name.dir</name>
- <value>/data/hadoop/dfs/name</value>
- </property>
- <property>
- <name>dfs.data.dir</name>
- <value>/data/hadoop/dfs/data</value>
- </property>
- <property>
- <name>dfs.permissions.superusergroup</name>
- <value>hadoop</value>
- </property>
- </configuration>
注意,这里我设定了secondarynamenode为serv19.
*property 的 default value 和 description 参考hdfs-default.
到这里为止上面所有的安装和设置都需要在每台机器上进行(设置对每台机器都是一样的),其实完全可以写一个 shell script 完成所有的安装,再 scp 这些 conf 文件到各个机子指定位置,这样比较省时间,但是前提是每台机子的环境都是一样的。
- 设置
conf/masters
(master only):
这里有几个概念需要注意, Namenode 是 HDFS 里的概念,即跑
bin/start-dfs.sh
的机器; JobTracker 是 MapReduce 里的概念,即跑bin/start-mapred.sh
的机器。一般情况下 Namenode 和 Jobtracker 都在同一台机器上跑,这台机器被称为 master,其它跑 Datanode 和 TaskTracker 的机器称为 slave. 这里我的 master 是serv20, 其他机器都是 slave。但是这里
conf/masters
定义的却不是上述意义的 master, 而是 SecondaryNamenode.这里我设置的SecondaryNamenode 是 serv19, 所以在 serv20 (master) 上做如下设置:注意,只需设置 master 上的
- <span style="font-family:Microsoft YaHei; font-size:14px">serv19</span>
conf/masters
, 不需要设置 slave 上的。
- 设置
conf/slaves
(master only):
该文件列出 所有的 slave host, 每行一个。该文件也只需要在 master 上设置。
- serv20
- serv19
- serv18
- serv17
- serv16
- serv15
- serv14
- serv13
- serv12
- serv11
- serv10
- serv09
- serv08
- serv07
这里serv20也被列为 slave, 意味着 serv20 同时是 master 和 slave.
3. 格式化HDFS (namenode)
在 Namenode 上格式化 HDFS, 即我的 serv20 上。理论上每次上述设置文件有变动时都需要重新格式化HDFS.
- hadoop@serv20:~$ hadoop namenode -format
如果已经跑了上述命令仍然得到类似下面的错误:
- 13/05/23 04:11:37 ERROR namenode.FSNamesystem: FSNamesystem initialization failed.
- java.io.IOException: <span style="color:#FF0000">NameNode is not formatted.</span>
- at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:330)
- at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
- at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:411)
那么试试下面命令:
- hadoop@serv20:~$ hadoop namenode -format -force
如果得到类似下面错误
- 2013-10-22 14:16:33,269 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: <span style="color:#FF0000">Incompatible namespaceIDs in
- /data/hadoop/dfs/data: namenode namespaceID = 2142266875; datanode namespaceID = 894925905</span>
- at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:232)
- at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:147)
- at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:414)
- at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:321)
- at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1712)
- at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1651)
- at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1669)
- at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1795)
- at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1812)
- 2013-10-22 14:16:33,270 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
- /************************************************************
- SHUTDOWN_MSG: Shutting down DataNode at serv07/127.0.1.1
- ************************************************************/
说明 Datanode 的 namespaceID 和 Namenode 的不一致,这可能是重新 format namenode导致的。在 hdfs-site 里定义的 dfs.data.dir 目录下找 VERSION 文件,这里我定义的是 /data/hadoop/dfs/data, 那么用下面命令修改 VERSION 文件:
- $ vim /data/hadoop/dfs/data/current/VERSION
- #Tue Oct 22 14:49:40 CDT 2013
- <span style="color:#FF0000">namespaceID=894925905 -> 改成 2142266875</span>
- storageID=DS-1631780027-10.11.8.27-50010-1380139765131
- cTime=0
- storageType=DATA_NODE
- layoutVersion=-41
如果得到类似下面错误
- 2013-07-05 14:04:40,557 INFO org.apache.hadoop.ipc.Server: Stopping server on 50010
- 2013-07-05 14:04:40,564 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: org.apache.hadoop.hdfs.server.common.
- InconsistentFSStateException: <span style="color:#FF0000">Directory /tmp/haloop/dfs/name is in an inconsistent state: storage directory does not exist or
- is not accessible. </span>
- at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:290)
- at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
- at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
- at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:292)
- at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
- at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:279)
- at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
- at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
- 2013-07-05 14:04:40,572 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
- /**********************************
- SHUTDOWN_MSG: Shutting down NameNode at elmorsy/127.0.1.1
- **********************************/
这个错误是说设定的namenode directory inconsistent, 像这里设定的/tmp就是 inconsistent 的, 它可能会更新或被删除,所以最后我是用了 /data/hadoop/dfs/name。这个问题也可能出现在 SecondaryNamenode 上。
4. 运行Hadoop
运行 Hadoop 有两个步骤, 即运行 HDFS 和 运行 MapReduce:
- $ bin/start-dfs.sh
- $ bin/start-mapred.sh
这里我的 Namenode 和 Jobtracker 是用一台机子,所以可以用下面命令:
- $ bin/start-all.sh
要查看每台机子是否正常运行可以用 jps (需要先安装):
- hadoop@serv20:~$ jps
- 13121 JobTracker
- 29872 Jps
- 13434 TaskTracker
- 12937 DataNode
- 12634 NameNode
- hadoop@serv19:~$ jps
- 17291 TaskTracker
- 17051 SecondaryNameNode
- 16793 DataNode
- 27317 Jps
- hadoop@serv18:~$ jps
- 13764 Jps
- 16157 DataNode
- 16386 TaskTracker
这里 serv20 是master同时也是slave,所以上面运行了 Namenode, JobTracker, Datanode, TaskTracker.
serv19 是 SecondaryNamenode 和 slave, 所以上面运行了 SecondaryNamenode, Datanode, TaskTracker.
serv18 只是 slave,所以上面只有 Datanode, TaskTracker.
5. 调试 Hadoop 程序
- 要查看 Hadoop WebUI, 参考下表:
更多请参看:hadoop default ports quick reference
Daemon Default Port Configuration Parameter HDFSNamenode50070dfs.http.addressDatanodes50075dfs.datanode.http.addressSecondarynamenode50090dfs.secondary.http.addressBackup/Checkpoint node?50105dfs.backup.http.addressMRJobracker50030mapred.job.tracker.http.addressTasktrackers50060mapred.task.tracker.http.address? Replaces secondarynamenode in 0.21.
- 要查看 Hadoop 日志,
- On the jobtracker:
HADOOP_HOME/logs /hadoop-username-jobtracker-hostname.log* => daemon logs /job_*.xml => job configuration XML logs /history /*_conf.xml => job configuration logs < everything else > => job statistics logs
HADOOP_HOME/logs /hadoop-username-namenode-hostname.log* => daemon logs
HADOOP_HOME/logs /hadoop-username-secondarynamenode-hostname.log* => daemon logs
HADOOP_HOME/logs /hadoop-username-datanode-hostname.log* => daemon logs
HADOOP_HOME/logs /hadoop-username-tasktacker-hostname.log* => daemon logs /userlogs /attempt_* /stderr => standard error logs /stdout => standard out logs /syslog => log4j logs
更多请参看:apache hadoop log files where to find them in cdh and what info they contain
更多Hadoop command 请参考commands manual.
- WordCount
详情请参看:wordcount v1.0
2. R的安装
R的安装可以用 apt-get 完成:
- $ sudo apt-get install r-base
查看 R 版本信息:
- $ R --version
- <span style="color:#3366FF">R version 2.14.1 (2011-12-22)</span>
- Copyright (C) 2011 The R Foundation for Statistical Computing
- ISBN 3-900051-07-0
- Platform: x86_64-pc-linux-gnu (64-bit)
- R is free software and comes with ABSOLUTELY NO WARRANTY.
- You are welcome to redistribute it under the terms of the
- GNU General Public License version 2.
- For more information about these matters see
- http://www.gnu.org/licenses/.
这里安装的是R 2.14, 目前最新的版本是 R 3.0, 要安装最新版本,需更新 apt source list:
- $ sudo apt-get remove r-base-core # Uninstall old R
- $ sudo vim /etc/apt/sources.list # Adding deb to sources.list
- # 添加这一行
- deb http://cran.r-project.org/bin/linux/ubuntu precise/ # precise is your ubuntu name (may be different)
- $ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9 # Add key to sign CRAN packages
- $ sudo apt-get update
- $ sudo apt-get upgrade
- $ sudo apt-get install r-base
Ubuntu release name 参考这里.
再查看版本:
- $ R --version
- <span style="color:#3366FF">R version 3.0.2 (2013-09-25)</span> -- "Frisbee Sailing"
- Copyright (C) 2013 The R Foundation for Statistical Computing
- Platform: x86_64-pc-linux-gnu (64-bit)
- R is free software and comes with ABSOLUTELY NO WARRANTY.
- You are welcome to redistribute it under the terms of the
- GNU General Public License versions 2 or 3.
- For more information about these matters see
- http://www.gnu.org/licenses/.
到这里R就安装好了,鉴于我使用的是Ubuntu server,所以这里所说的 R 是 command line, 不是 RStudio, 也不是 Revolution R.
3. R Hadoop的安装
所谓的 RHadoop 安装并不是一个另外的单独的 software,而是 R package: rhdfs, rmr2, rhbase.
从名字不难推断每个包的用途,rhdfs 是对应 HDFS 的 R interface, rmr2 是对应 MapReduce, rhbase 是对应 HBase. 这里我没有用到 HBase, 所以暂时先不谈 rhbase 的安装,以后如果要用再补上。
- 预装依赖库
- $ sudo apt-get install git libboost-dev libboost-test-dev libboost-program-options-dev libevent-dev automake libtool
- flex bison pkg-config g++ libssl-dev
- 设置 Java 配置
- $ sudo R CMD javareconf
注意这里需要root权限,否则会得到 Permission Denied.
- 设置环境变量
同前面提到的一样,有两种方式,直接 export 和修改 .bashrc 文件。
- export HADOOP_CMD=$HADOOP_HOME/bin/hadoop
- export HADOOP_STREAMING=$HADOOP_HOME/contrib/streaming/hadoop-streaming-1.2.1.jar
- export LD_LIBRARY_PATH=/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server
另一种方式是在 R 里面设置:这里需要注意的是权限问题,如果修改的是hadoop user 的 .bashrc 文件或者以 hadoop 身份export, 这些环境变量就只对 hadoop user 有用,但是下面一个步骤需要 root 权限,所以可能需要重新在 R 里重新设置一下。检查 R 是否设置了这些环境变量,用下面命令:
- > Sys.setenv("HADOOP_CMD"="/home/hadoop/hadoop-1.2.1/bin/hadoop")
- > Sys.setenv("HADOOP_STREAMING"="/home/hadoop/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar")
- > Sys.setenv("LD_LIBRARY_PATH"="/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server")
- > Sys.getenv("HADOOP_CMD")
- 预装R依赖库
这里需要以 root 权限启动R:
这样这些包就被安装到默认的 R 路径下: /usr/lib/R/library/
- $ sudo R
- > install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2"))
若不是这样,R 会提示是否使用 personal library:
- > install.packages("Rcpp")
- Installing package into ‘/usr/local/lib/R/site-library’
- (as ‘lib’ is unspecified)
- Warning in install.packages("Rcpp") :
- 'lib = "/usr/local/lib/R/site-library"' is not writable
- Would you like to use a personal library instead? (y/n) y
- Would you like to create a personal library
- <span style="color:#3366FF">~/R/x86_64-pc-linux-gnu-library/3.0</span>
- to install packages into? (y/n)
如上面显示的,如果使用 personal library, 它会新建一个 R directory 在你的 $HOME. 这样是可以成功安装并加载这个包的:
但是我发现之后运行RHadoop的时候会出现类似下面错误(能在 WebUI 看到):
- > library(rmr2)
- Loading required package: Rcpp
- Loading required package: RJSONIO
- Loading required package: bitops
- Loading required package: digest
- Loading required package: functional
- Loading required package: stringr
- Loading required package: plyr
- Loading required package: reshape2
- java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
- at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
- at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
- at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
- at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
- at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
- at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
- at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
- at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
- at java.security.AccessController.doPrivileged(Native Method)
- at javax.security.auth.Subject.doAs(Subject.java:415)
- at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
- at org.apache.hadoop.mapred.Child.main(Child.java:249)
或者在 command line里看到类似这样的提示:
- 13/08/21 18:30:25 ERROR streaming.StreamJob: Job not Successful!
- Streaming Command Failed!
- Error in mr(map = map, reduce = reduce, combine = combine, in.folder = if (is.list(input)) { :
- hadoop streaming failed with error code 1
这两个提示都只是 general 的 Hadoop 提示和 Java 提示,并没有明确表示原因。错误原因需要从上述 takstracker log strerr里去找:原因是这个包装在了personal directory, R知道到哪里去找这个包,但是Hadoop只会在默认的R system directory 即/usr/local/lib/R/下面去找。一个简单的解决办法就是把该包从 personal library 拷贝到 system directory.
- $ vim <span style="color:#3366FF">$HADOOP_HOME/logs/userlogs/job_201310221441_0002/attempt_201310221441_0002_m_000000_1/stderr</span>
- Error in loadNamespace(name) : <span style="color:#FF0000">there is no package called ‘Rcpp’</span>
- Calls: <Anonymous> ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
- Execution halted
- java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
- at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
- at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
- at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
- at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
- at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
- at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
- at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
- at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
- at java.security.AccessController.doPrivileged(Native Method)
- at javax.security.auth.Subject.doAs(Subject.java:415)
- at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
- at org.apache.hadoop.mapred.Child.main(Child.java:249)
- 安装 rmr2,rhdfs
由于这三个库不能在CRAN中找到,所以需要自己下载。
https://github.com/RevolutionAnalytics/RHadoop/wiki下载好后用下面命令安装:
这里如果上面都是以root运行R, 那么安装不会有问题,可是当运行RHadoop的时候会出现Permission Denied错误:
- > install.packages('/home/hadoop/rmr2_2.3.0.tar.gz',repo=NULL,type="source")
- > install.packages('/home/hadoop/rhdfs_1.0.7.tar.gz',repo=NULL,type="source")
那是因为root并不是我的hadoop cluster 的user, 解决办法有两个,要么修改 hadoop 权限设置,在conf/hdfs-site.xml里添加下面
- org.apache.hadoop.security.AccessControlException: <span style="color:#FF0000">Permission denied: user=root,</span> access=WRITE, inode="":
- hadoop:supergroup:rwxr-xr-x
要么以 hadoop 运行R, 可是这样又会出现这样错误:
- <property>
- <name>dfs.permissions</name>
- <value>false</value>
- </property>
这是因为我之前没有设置 LD_LIBRARY_PATH 指向 rJava.so,如果按照我前面写的设置了LD_LIBRARY_PATH就不会出现这个问题了.
- > library(rhdfs)
- Loading required package: rJava
- Error : .onLoad failed in loadNamespace() for 'rJava', details:
- call: dyn.load(file, DLLpath = DLLpath, ...)
- <span style="color:#FF0000">error: unable to load shared object '/usr/lib/R/site-library/rJava/libs/rJava.so':</span>
- libjvm.so: cannot open shared object file: No such file or directory
- <span style="color:#FF0000">Error: package ‘rJava’ could not be loaded</span>
如果看到下面错误:
说明当前R user 没有设置 HADOOP_CMD,就是我前面讲到的如果sudo R, 那么user是root, 而我前面只设置了hadoop的环境变量。
- > library("rhdfs")
- Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
- call: fun(libname, pkgname)
- error: Environment variable HADOOP_CMD must be set before loading package rhdfs
- Error: package/namespace load failed for ‘rhdfs’
如果load rmr2和rhdfs都没有问题,那么RHadoop就算安装成功了。
- 运行RHadoop 程序
- > library(rmr2)
- Loading required package: Rcpp
- Loading required package: RJSONIO
- Loading required package: bitops
- Loading required package: digest
- Loading required package: functional
- Loading required package: stringr
- Loading required package: plyr
- Loading required package: reshape2
- > library(rhdfs)
- Loading required package: rJava
- HADOOP_CMD=/home/hadoop/hadoop-1.2.1/bin/hadoop
- Be sure to run hdfs.init()
- > map <- function(k,lines) {
- + words.list <- strsplit(lines, '\\s')
- + words <- unlist(words.list)
- + return( keyval(words, 1) )
- + }
- > reduce <- function(word, counts) {
- + keyval(word, sum(counts))
- + }
- > wordcount <- function (input, output=NULL) {
- + mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)
- + }
- > hdfs.data <- '/user/hadoop/input'
- > hdfs.out <- '/user/hadoop/output'
- > out <- wordcount(hdfs.data, hdfs.out)
- packageJobJar: [/tmp/RtmpbgF5IT/rmr-local-env6eb71e7e6c51, /tmp/RtmpbgF5IT/rmr-global-env6eb74c4d3f75, /tmp/RtmpbgF5IT/rmr-streaming-map6eb75e339403, /tmp/RtmpbgF5IT/rmr-streaming-reduce6eb711a6cae0, /data/hadoop/hadoop-unjar7453012150899590081/] [] /tmp/streamjob5631425967008143655.jar tmpDir=null
- 13/10/23 10:31:26 INFO util.NativeCodeLoader: Loaded the native-hadoop library
- 13/10/23 10:31:26 WARN snappy.LoadSnappy: Snappy native library not loaded
- 13/10/23 10:31:26 INFO mapred.FileInputFormat: Total input paths to process : 3
- 13/10/23 10:31:26 INFO streaming.StreamJob: getLocalDirs(): [/data/hadoop/dfs/local]
- 13/10/23 10:31:26 INFO streaming.StreamJob: Running job: job_201310221441_0004
- 13/10/23 10:31:26 INFO streaming.StreamJob: To kill this job, run:
- 13/10/23 10:31:26 INFO streaming.StreamJob: /home/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job -Dmapred.job.tracker=serv20:54311 -kill job_201310221441_0004
- 13/10/23 10:31:26 INFO streaming.StreamJob: Tracking URL: http://serv20:50030/jobdetails.jsp?jobid=job_201310221441_0004
- 13/10/23 10:31:27 INFO streaming.StreamJob: map 0% reduce 0%
- 13/10/23 10:31:32 INFO streaming.StreamJob: map 33% reduce 0%
- 13/10/23 10:31:33 INFO streaming.StreamJob: map 67% reduce 0%
- 13/10/23 10:31:34 INFO streaming.StreamJob: map 100% reduce 0%
- 13/10/23 10:31:39 INFO streaming.StreamJob: map 100% reduce 33%
- 13/10/23 10:31:42 INFO streaming.StreamJob: map 100% reduce 82%
- 13/10/23 10:31:45 INFO streaming.StreamJob: map 100% reduce 94%
- 13/10/23 10:31:48 INFO streaming.StreamJob: map 100% reduce 100%
- 13/10/23 10:31:51 INFO streaming.StreamJob: Job complete: job_201310221441_0004
- 13/10/23 10:31:51 INFO streaming.StreamJob: Output: /user/hadoop/output
- > results <- from.dfs(out)
- > results.df <- as.data.frame(results, stringsAsFactors=F)
- > colnames(results.df) <- c('word', 'count')
- > head(results.df)
- word count
- 1 17259
- 2 % 2
- 3 & 21
- 4 ( 1
- 5 ) 3
- 6 * 90
- rhadoop
- RHADOOP
- rhadoop
- RHadoop
- RHadoop安装
- rhadoop安装
- RHadoop安装
- run rhadoop
- 安装RHadoop
- RHadoop搭建
- Rhadoop实战:协同过滤
- RHadoop的安装
- RHadoop搭建(HDFS+MapReduce)
- RHadoop搭建(HBase)
- rhadoop linear regression 问题
- RHadoop实践系列文章
- 搭建RHadoop环境
- RHadoop实现kmeans聚类
- 关于PE病毒编写的学习(4)——关于遍历磁盘的讨论
- 将普通表在线重定义为分区表
- 关于PE病毒编写的学习(5)——病毒如何做标记和记录信息
- 100美元哪里去了?
- Valid Palindrome
- RHADOOP
- Mac下cocos2d-x工程创建
- csu1018
- 关于PE病毒编写的学习(6)——关于PE文件结构操作的程序编写
- android性能测试之内存泄漏
- cocos2dx3.0游戏编程2-动作游戏之自定义精灵类
- Ubuntu下搭建FTP服务器
- 关于PE病毒编写的学习(7)——定位API的N种方法
- csu1160