windows Hadoop开发环境搭建及远程提交

来源:互联网 发布:python 遗传算法 编辑:程序博客网 时间:2024/06/01 08:17

这篇文章将介绍如何搭建hadoop的开发环境,并且详细描述如何通过intellij idea开发hadoop的map-reduce程序以及远程提交。
前提:

  • 需要在本机下载hadoop,不需要修改配置安装,但需要设置下hadoop_home,java_home等
  • 下载winutils,并解压放在$Hadoop_HOME/bin目录下
  • 如果集群配置中都是指定的主机名,那么需要在你本机hosts中加上集群主机解析(不加也可以,就是不太方便)

方法一:maven项目

1、intellij idea创建maven项目这里就不多说了,先创建一个maven项目。
2、配置pom.xml文件,补全pom.xml文件之后,idea会自动下载jar包并引入。

<dependencies><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>3.8.1</version><scope>test</scope></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId><version>2.8.0</version></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-client</artifactId><version>2.8.0</version></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-hdfs</artifactId><version>2.8.0</version></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-mapreduce-client-core</artifactId><version>2.8.0</version></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-mapreduce-client-jobclient</artifactId><version>2.8.0</version></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-mapreduce-client-common</artifactId><version>2.8.0</version></dependency></dependencies>

方法二:新建java项目

1、intellij idea创建java项目

2、添加依赖

这里写图片描述
这里写图片描述

导入成功后

这里写图片描述

3、编写代码

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import java.io.IOException;import java.util.StringTokenizer;public class WordCount {public static class TokenizerMapperextends Mapper<Object, Text, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(Object key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);}}}public static class IntSumReducerextends Reducer<Text, IntWritable, Text, IntWritable> {private IntWritable result = new IntWritable();public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}result.set(sum);context.write(key, result);}}private static void deleteDir(Configuration conf, String dirPath) throws IOException {FileSystem fs = FileSystem.get(conf);Path targetPath = new Path(dirPath);if (fs.exists(targetPath)) {boolean delResult = fs.delete(targetPath, true);if (delResult) {System.out.println(targetPath + " has been deleted sucessfullly.");} else {System.out.println(targetPath + " deletion failed.");}}}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();/* String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();if (otherArgs.length < 2) {System.err.println("Usage: wordcount <in> [<in>...] <out>");System.exit(2);}//先删除output目录deleteDir(conf, otherArgs[otherArgs.length - 1]);*/Job job = Job.getInstance(conf, "word count");job.setJarByClass(WordCount.class);job.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.exit(job.waitForCompletion(true) ? 0 : 1);}}

统计 args第一个参数对应的文件目录中所有文件中单词出现的次数
输出结果在第二个参数对应的文件目录中会自动创建目录 运行前要保证目录不存在

4、编辑configuration
这里写图片描述

5、运行成功

这里写图片描述

远程配置

新建Resource目录,配置为项目Resources

这里写图片描述

添加core-site.xml文件到Resource目录下

这里写图片描述

<configuration><property><name>fs.defaultFS</name><value>hdfs://192.168.89.135:9000</value></property></configuration>

可以直接从Hadoo的配置文件复制过来

修改configurations
修改输入输入文件地址为远程hdfs地址
这里写图片描述

本地提交

如果你的hadoop和idea在同一台服务器上,那么你可以选择Local提交
1、把coer-site.xml、log4j.properties复制到项目的源码根目录下(保证编译后在class目录下能找到该两个文件),为什么要这样呢?因为你直接在idea中提交job,会加载class文件夹下的配置文件,如果没有log4j.properties文件,则会提示log4j没有初始化,结果是没有任务信息打印。core-site.xml一样,如果不放到源文件目录下,则会报hdfs权限等问题。
2、在idea中直接运行该类的主方法,就可以提交到本地hadoop伪分布安装模式上了,可以对代码进行调试。
3、注意:我们在hadoop的配置文件mapred-site.xml指定了YARN调度,但是提交job的时候,根据debug之后发现,调用的是LocalCluster。并没有使用YARN.有如下两点原因:
【原因1:】需要把mared-site.xml文件和yarm.xml文件放到resource文件夹下
【原因2:】需要把文件程序打包才能进行远程提交job见:下一节远程提交

远程提交

如果你的hadoop是集群或者是其他服务器,idea在不同的服务器你可以选择远程提交,在hadoop-2.8.0中使用YARN进行调度。
1、把core-site.xml、hdfs-site.xml、mapred-site.xml、yarn.xml、log4j.properties等文件放到resource目录,如果不添加这些文件,相关设置需要在代码中指定

conf.set("mapreduce.job.jar", "E:\\hadoop\\myhadoop\\out\\artifacts\\wordcount\\wordcount.jar");//指定Jar包,也可以在job中设置conf.set("mapreduce.framework.name", "yarn");//以yarn形式提交conf.set("yarn.resourcemanager.hostname", "master");conf.set("mapreduce.app-submission.cross-platform", "true");//跨平台提交

如果集群设置了hdfs访问权限限制,比如开启了指定用户xxx才能访问那么可以在程序里设置

System.setProperty("HADOOP_USER_NAME", "xxx")

2、先把该project进行打包,使用maven或者idea的自动打包功能进行打包

  • maven
mvn package
  • Idea自动打包

    因为集群上已经有了相关的环境,这里打包就不用添加依赖到了,选择Empty。这样调试时Build速度快。

    Project Structure=>Artifacts=> 点左上角的 + =>Empty =>Output Layout + => Module Output =>选择项目文件夹=>点击jar包,设置MainClass 即可

3、需要在程序代码中设置job.setJar

job.setJar("E:\\hadoop\\myhadoop\\out\\artifacts\\wordcount\\wordcount.jar");

4、程序代码中:10020端口是hadoop历史服务,需要在服务器端启动

mr-jobhistory-daemon.sh start historyserver & #启动历史服务

5、在idea中运行程序,就提交了job,并且该种job提交方式还可以进行在idea中进行源码调试。

6、自动提交Jar包到集群上(非必须)
Tools -> Deployment -> Configuration点击左上角 + ,Type选择SFTP,然后配置服务器ip和部署路径,用户名、密码等选项之后选择自动部署,这样每次修改都会自动部署到服务器,也可以右键,选择Deployment,upload to …

常见问题:

问题1:

Exception in thread "main" java.lang.RuntimeException: java.io.FileNotFoundException: Could not locate Hadoop executable: E:\hadoop-2.8.0\bin\winutils.exe -see https://wiki.apache.org/hadoop/WindowsProblems    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:716)    at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:250)    at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:267)    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:771)    at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:515)    at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:555)    at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:533)    at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:313)    at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:133)    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:146)    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:415)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1359)    at WordCount.main(WordCount.java:92)Caused by: java.io.FileNotFoundException: Could not locate Hadoop executable: E:\hadoop-2.8.0\bin\winutils.exe -see https://wiki.apache.org/hadoop/WindowsProblems    at org.apache.hadoop.util.Shell.getQualifiedBinInner(Shell.java:598)    at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:572)    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:669)    at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:441)    at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:487)    at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)    at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)    at WordCount.main(WordCount.java:71)Process finished with exit code 1

解决办法:将winutil.exe放在$HADOOP_HOME/bin目录下

问题2:

2017-08-04 12:31:00,668 WARN  [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable2017-08-04 12:31:01,230 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1181)) - session.id is deprecated. Instead, use dfs.metrics.session-id2017-08-04 12:31:01,230 INFO  [main] jvm.JvmMetrics (JvmMetrics.java:init(79)) - Initializing JVM Metrics with processName=JobTracker, sessionId=2017-08-04 12:31:01,495 WARN  [main] mapreduce.JobResourceUploader (JobResourceUploader.java:uploadFiles(171)) - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).2017-08-04 12:31:01,542 INFO  [main] input.FileInputFormat (FileInputFormat.java:listStatus(289)) - Total input files to process : 12017-08-04 12:31:01,870 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(200)) - number of splits:12017-08-04 12:31:02,104 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(289)) - Submitting tokens for job: job_local1047774324_0001Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:606)    at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:958)    at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:203)    at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:190)    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:124)    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:314)    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:377)    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:151)    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132)    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:116)    at org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:125)    at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:171)    at org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:758)    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:242)    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)    at java.security.AccessController.doPrivileged(Native Method)2017-08-04 12:31:02,167 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(251)) - Cleaning up the staging area file:/tmp/hadoop/mapred/staging/alex1047774324/.staging/job_local1047774324_0001    at javax.security.auth.Subject.doAs(Subject.java:415)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1359)    at WordCount.main(WordCount.java:92)

解决办法:缺少hadoop.dll,把hadoop.dll放在$HADOOP_HOME/bin目录下

问题3:

2017-08-04 12:47:49,125 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(897)) - Retrying connect to server: master/192.168.89.135:9000. Already tried 0 time(s); maxRetries=45

解决办法:远程主机没有启动hadoop,若启动了检查是否关闭了firewalld.service和iptables.service

问题4:

2017-11-29 21:10:22,214 INFO  [main] client.RMProxy (RMProxy.java:createRMProxy(123)) - Connecting to ResourceManager at master/192.168.89.136:80322017-11-29 21:10:23,259 INFO  [main] input.FileInputFormat (FileInputFormat.java:listStatus(289)) - Total input files to process : 12017-11-29 21:10:24,216 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(200)) - number of splits:12017-11-29 21:10:24,769 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(289)) - Submitting tokens for job: job_1511957984981_00072017-11-29 21:10:24,984 INFO  [main] impl.YarnClientImpl (YarnClientImpl.java:submitApplication(296)) - Submitted application application_1511957984981_00072017-11-29 21:10:25,024 INFO  [main] mapreduce.Job (Job.java:submit(1345)) - The url to track the job: http://master:8088/proxy/application_1511957984981_0007/2017-11-29 21:10:25,024 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1390)) - Running job: job_1511957984981_00072017-11-29 21:10:28,088 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1411)) - Job job_1511957984981_0007 running in uber mode : false2017-11-29 21:10:28,090 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1418)) -  map 0% reduce 0%2017-11-29 21:10:28,164 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1431)) - Job job_1511957984981_0007 failed with state FAILED due to: Application application_1511957984981_0007 failed 2 times due to AM Container for appattempt_1511957984981_0007_000002 exited with  exitCode: 1Failing this attempt.Diagnostics: Exception from container-launch.Container id: container_1511957984981_0007_02_000001Exit code: 1Exception message: /bin/bash: line 0: fg: no job controlStack trace: ExitCodeException exitCode=1: /bin/bash: line 0: fg: no job control    at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)    at org.apache.hadoop.util.Shell.run(Shell.java:869)    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)    at java.util.concurrent.FutureTask.run(FutureTask.java:266)    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)    at java.lang.Thread.run(Thread.java:748)Container exited with a non-zero exit code 1For more detailed output, check the application tracking page: http://master:8088/cluster/app/application_1511957984981_0007 Then click on links to logs of each attempt.. Failing the application.2017-11-29 21:10:28,199 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1436)) - Counters: 0Process finished with exit code 1

这是因为windows 和远程Linux集群跨平台造成的

解决办法:在代码中添加

conf.set("mapreduce.app-submission.cross-platform", "true");
阅读全文
0 0