IDEA向hadoop集群提交作业

来源:互联网 发布:ubuntu 卸载docker 编辑:程序博客网 时间:2024/05/22 13:13

1. 说明

  • 本地环境:Intellij IDEA15.0.2 、jdk-7u65-windows-x64.exe、hadoop-2.6.1.tar.gz
  • 集群环境及其配置详情请见:http://blog.csdn.net/qq_28039433/article/details/78147172
  • 本文原先是根据http://blog.csdn.net/uq_jin/article/details/52235121 进行配置,发现该配置只能将作业提交到本机的hadoop上运行,后来结合http://blog.csdn.net/u011654631/article/details/70037219来搭建IDEA远程向hadoop集群提交作业。

2. 配置本机hadoop环境

2.1解压hadoop-2.6.1.tar.gz至任意一个目录

我这里选择将其解压到E:\java\hadoop-2.6.1目录下。

2.2设置hadoop环境变量

注意HADOOP_USER_NAME值设置为Hadoop集群里的用户名。不然会报org.apache.hadoop.security.AccessControlException。我的Hadoop集群的用户名是hadoop

HADOOP_HOME=E:\java\hadoop-2.6.1HADOOP_BIN_PATH=%HADOOP_HOME%\binHADOOP_PREFIX=%HADOOP_HOME%在Path后面加上%HADOOP_HOME%\bin;%HADOOP_HOME%\sbin;HADOOP_USER_NAME=hadoop

2.3配置内网映射

在C:\Windows\System32\drivers\etc\hosts文末追加三行,与centos6.5里的/etc/hosts配置相同

192.168.48.101 hdp-node-01192.168.48.102 hdp-node-02192.168.48.103 hdp-node-03

3. 搭建项目

jdk的安装在这里就不做详细介绍,本机跟Hadoop集群的jdk安装的版本尽量一致。

3.1 新建Maven项目

这里写图片描述
这里写图片描述
这里写图片描述

3.2 在pom.xml中加入依赖

 <dependencies>        <dependency>            <groupId>junit</groupId>            <artifactId>junit</artifactId>            <version>4.12</version>        </dependency>        <dependency>            <groupId>org.apache.hadoop</groupId>            <artifactId>hadoop-client</artifactId>            <version>2.6.1</version>        </dependency>        <dependency>            <groupId>org.apache.hadoop</groupId>            <artifactId>hadoop-common</artifactId>            <version>2.6.1</version>        </dependency>        <dependency>            <groupId>org.apache.hadoop</groupId>            <artifactId>hadoop-hdfs</artifactId>            <version>2.6.1</version>        </dependency>    </dependencies>

完成后,如果External Libraries里没有依赖的包,在右下角Event Log中有提示Maven projects need to be imported: Import Changes Enable Auto-Import,点击Import Changes

3.3 设置配置文件

将hadoop集群中配置文件core-site.xml、mapred-site.xml、yarn-site.xml 原封不动地复制到resources目录下。以下是我的配置文件
core-site.xml

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration><property> <name>fs.defaultFS</name><value>hdfs://hdp-node-01:9000</value></property><property> <name>hadoop.tmp.dir</name><value>/home/hadoop/apps/hadoop-2.6.1/tmp</value></property></configuration>

mapred-site.xml

<configuration><property><name>mapreduce.framework.name</name><value>yarn</value></property></configuration>

yarn-site.xml

<?xml version="1.0"?><configuration><property><name>yarn.resourcemanager.hostname</name><value>hdp-node-01</value></property><property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property><!-- Site specific YARN configuration properties --></configuration>

log4j.properties

log4j.rootLogger=INFO, stdoutlog4j.appender.stdout=org.apache.log4j.ConsoleAppenderlog4j.appender.stdout.layout=org.apache.log4j.PatternLayoutlog4j.appender.stdout.layout.ConversionPattern=%d{ABSOLUTE} | %-5.5p | %-16.16t | %-32.32c{1} | %-32.32C %4L | %m%n

3.4 编写程序

WordCountMapper.java

import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;import java.io.IOException;public class WordCountMapper extends Mapper<LongWritable,Text,Text,IntWritable> {    @Override    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();        String[] words = line.split(" ");        for (String word : words) {            context.write(new Text(word),new IntWritable(1));        }    }}

WordCountReducer.java

import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;import java.io.IOException;public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {    @Override    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {//        super.reduce(key, values, context);        int count = 0 ;        for (IntWritable value:values) {            count += value.get();        }        context.write(key,new IntWritable((count)));    }}

WordCountRunner.java

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import java.text.SimpleDateFormat;import java.util.Date;public class WordCountRunner {    public static void main(String[] args) throws Exception {        Configuration config = new Configuration();        config.set("mapreduce.framework.name", "yarn");//集群的方式运行,非本地运行        config.set("mapreduce.app-submission.cross-platform", "true");//意思是跨平台提交,在windows下如果没有这句代码会报错 "/bin/bash: line 0: fg: no job control",去网上搜答案很多都说是linux和windows环境不同导致的一般都是修改YarnRunner.java,但是其实添加了这行代码就可以了。        config.set("mapreduce.job.jar","D:\\wordcount\\out\\artifacts\\wordcount_jar\\wordcount.jar");        Job job = Job.getInstance(config);        job.setJarByClass(WordCountRunner.class);        job.setMapperClass(WordCountMapper.class);        job.setReducerClass(WordCountReducer.class);        job.setMapOutputKeyClass(Text.class);        job.setMapOutputValueClass(IntWritable.class);        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(IntWritable.class);        //要处理的数据输入与输出地址        FileInputFormat.setInputPaths(job,"hdfs://hdp-node-01:9000/wordcount/input/somewords.txt");        SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy_MM_dd_HH_mm_ss");        FileOutputFormat.setOutputPath(job,new Path("hdfs://hdp-node-01:9000/wordcount/output/"+ simpleDateFormat.format(new Date(System.currentTimeMillis()))));        boolean res = job.waitForCompletion(true);        System.exit(res?0:1);    }}

注意mapreduce.job.jar 参数设置为jar的路径。

3.5 导出jar

点击File -》project Structure
这里写图片描述
这里写图片描述

这里写图片描述
注意勾上Build on make选项。3.4里的mapreduce.job.jar地址跟这Output directory地址前缀相同
最后点击Build-》Build Artifacts-》Build后会在根目录下会生成out目录。

3.6 运行程序

运行程序前先要启动hadoop集群。
这里写图片描述
去http://download.csdn.net/detail/u010435203/9606355 下载winutils.exe放到hadoop/bin下面。

运行成功会控制台会显示:

16:44:07,037 | WARN  | main             | NativeCodeLoader                 | che.hadoop.util.NativeCodeLoader   62 | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable16:44:11,203 | INFO  | main             | RMProxy                          | pache.hadoop.yarn.client.RMProxy   98 | Connecting to ResourceManager at hdp-node-01/192.168.48.101:803216:44:13,785 | WARN  | main             | JobResourceUploader              | op.mapreduce.JobResourceUploader   64 | Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.16:44:17,581 | INFO  | main             | FileInputFormat                  | reduce.lib.input.FileInputFormat  281 | Total input paths to process : 116:44:18,055 | INFO  | main             | JobSubmitter                     | he.hadoop.mapreduce.JobSubmitter  199 | number of splits:116:44:18,780 | INFO  | main             | JobSubmitter                     | he.hadoop.mapreduce.JobSubmitter  288 | Submitting tokens for job: job_1506933793385_000116:44:20,138 | INFO  | main             | YarnClientImpl                   | n.client.api.impl.YarnClientImpl  251 | Submitted application application_1506933793385_000116:44:20,307 | INFO  | main             | Job                              | org.apache.hadoop.mapreduce.Job  1301 | The url to track the job: http://hdp-node-01:8088/proxy/application_1506933793385_0001/16:44:20,309 | INFO  | main             | Job                              | org.apache.hadoop.mapreduce.Job  1346 | Running job: job_1506933793385_000116:45:03,829 | INFO  | main             | Job                              | org.apache.hadoop.mapreduce.Job  1367 | Job job_1506933793385_0001 running in uber mode : false16:45:03,852 | INFO  | main             | Job                              | org.apache.hadoop.mapreduce.Job  1374 |  map 0% reduce 0%16:45:40,267 | INFO  | main             | Job                              | org.apache.hadoop.mapreduce.Job  1374 |  map 100% reduce 0%16:46:08,081 | INFO  | main             | Job                              | org.apache.hadoop.mapreduce.Job  1374 |  map 100% reduce 100%16:46:09,121 | INFO  | main             | Job                              | org.apache.hadoop.mapreduce.Job  1385 | Job job_1506933793385_0001 completed successfully16:46:09,562 | INFO  | main             | Job                              | org.apache.hadoop.mapreduce.Job  1392 | Counters: 49    File System Counters        FILE: Number of bytes read=256        FILE: Number of bytes written=212341        FILE: Number of read operations=0        FILE: Number of large read operations=0        FILE: Number of write operations=0        HDFS: Number of bytes read=259        HDFS: Number of bytes written=152        HDFS: Number of read operations=6        HDFS: Number of large read operations=0        HDFS: Number of write operations=2    Job Counters         Launched map tasks=1        Launched reduce tasks=1        Data-local map tasks=1        Total time spent by all maps in occupied slots (ms)=30792        Total time spent by all reduces in occupied slots (ms)=24300        Total time spent by all map tasks (ms)=30792        Total time spent by all reduce tasks (ms)=24300        Total vcore-seconds taken by all map tasks=30792        Total vcore-seconds taken by all reduce tasks=24300        Total megabyte-seconds taken by all map tasks=31531008        Total megabyte-seconds taken by all reduce tasks=24883200    Map-Reduce Framework        Map input records=1        Map output records=18        Map output bytes=214        Map output materialized bytes=256        Input split bytes=118        Combine input records=0        Combine output records=0        Reduce input groups=15        Reduce shuffle bytes=256        Reduce input records=18        Reduce output records=15        Spilled Records=36        Shuffled Maps =1        Failed Shuffles=0        Merged Map outputs=1        GC time elapsed (ms)=533        CPU time spent (ms)=5430        Physical memory (bytes) snapshot=311525376        Virtual memory (bytes) snapshot=1680896000        Total committed heap usage (bytes)=136122368    Shuffle Errors        BAD_ID=0        CONNECTION=0        IO_ERROR=0        WRONG_LENGTH=0        WRONG_MAP=0        WRONG_REDUCE=0    File Input Format Counters         Bytes Read=141    File Output Format Counters         Bytes Written=152Process finished with exit code 0

4. 常见问题FAQ

4.1 权限问题

Exception in thread "main" org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=dvqfq6prcjdsh4p\hadoop, access=WRITE, inode="hadoop":hadoop:supergroup:rwxr-xr-x

在hdfs-site.xml增加

<property>    <name>dfs.permissions</name>    <value>false</value> </property>

在环境变量里加HADOOP_USER_NAME=hadoop。详情见2.2

4.2 时间同步问题

Container launch failed for container_1506950816832_0005_01_000002 : org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. This token is expired. current time is 1506954189368 found 1506953252362

多个datanode与namenode进行时间同步,在每台服务器执行:ntpdate time.nist.gov,确认时间同步成功。
最好在每台服务器的 /etc/crontab 中加入一行:
0 2 * * * root ntpdate time.nist.gov && hwclock -w

4.3

Stack trace: ExitCodeException exitCode=1: /bin/bash: line 0: fg: no job control

jar地址错误,注意mapreduce.job.jar 的配置。