hadoop伪分布式试运行

来源:互联网 发布:恩典壁纸软件 编辑:程序博客网 时间:2024/05/16 12:32

  伪分布式读取的是HDFS上的数据,要使用HDFS,首先在HDFS中创建用户目录:

hdfs dfs -mkdir -p /user/hadoop

  接着可以将本地文件作为输入文件复制到HDFS中,比如将hadoop的配置xml文件复制到HDFS的/user/hadoop/input中。

hdfs dfs -mkdir inputhdfs dfs -put ./etc/hadoop/*.xml input

  伪分布式运行MapReduce作业的方式和单机模式相同,不过伪分布式读写的是HDFS。

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha4.jar wordcount input output

  理论上是可以了,等着跑完在output下找结果。但是报这样的错:

2017-11-03 16:56:34,091 INFO mapreduce.Job: Job job_1509699271441_0001 failed with state FAILED due to: Application application_1509699271441_0001 failed 2 times due to AM Container for appattempt_1509699271441_0001_000002 exited with  exitCode: 1Failing this attempt.Diagnostics: [2017-11-03 16:56:33.411]Exception from container-launch.Container id: container_1509699271441_0001_02_000001Exit code: 1Stack trace: ExitCodeException exitCode=1:     at org.apache.hadoop.util.Shell.runCommand(Shell.java:994)    at org.apache.hadoop.util.Shell.run(Shell.java:887)    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1212)    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:295)    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:455)    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:275)    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:90)    at java.util.concurrent.FutureTask.run(FutureTask.java:266)    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)    at java.lang.Thread.run(Thread.java:748)

  解决办法呢,是按官网分别给mapred-site.xml和yarn-site.xml加了这样的配置:

<property>    <name>mapreduce.application.classpath</name>    <value>        /usr/local/hadoop/share/hadoop/mapreduce/*,        /usr/local/hadoop/share/hadoop/mapreduce/lib/*    </value></property>
<property>    <name>yarn.application.classpath</name>    <value>        /usr/local/hadoop/etc/hadoop,        /usr/local/hadoop/share/hadoop/common/*,        /usr/local/hadoop/share/hadoop/common/lib/*,        /usr/local/hadoop/share/hadoop/hdfs/*,        /usr/local/hadoop/share/hadoop/hdfs/lib/*,        /usr/local/hadoop/share/hadoop/yarn/*,        /usr/local/hadoop/share/hadoop/yarn/lib/*    </value></property>

  不要像官网使用类似$HADOOP_HOME这样的,还是得写绝对路径,试了很多遍了,前面那种没用。  配置好重启以后,进入下一个坑。运行mapreduce后,看着是正常了,但是发现结束后output是空的。看了日志发现这个:

2017-11-03 21:25:37,004 INFO mapreduce.Job: Task Id : attempt_1509715385291_0001_m_000005_2, Status : FAILED[2017-11-03 21:25:34.662]Container [pid=17753,containerID=container_1509715385291_0001_01_000018] is running beyond virtual memory limits. Current usage: 126.4 MB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used. Killing container.

  网上查了下,大概是说虚拟内存溢出吧。需要配置yarn-site.xml中的yarn.nodemanager.vmem-check-enabled,这个默认是true的,需要加上false的配置。

<property>  <name>yarn.nodemanager.vmem-check-enabled</name>  <value>false</value></property>

  OK,到这里终于是正常跑完了wordcount。可以查看输出结果:

hdfs dfs -cat output/*

也可以从hdfs中拿到本地:

hdfs dfs -get output ./output

OVER!