hadoop编程(2)-准备编程和本地测试环境

来源:互联网 发布:webgis python 编辑:程序博客网 时间:2024/05/16 13:54

作为开发人员,我们可以暂时忽略集群等部署环境,首要关注开发环境。本文介绍一种可在IDE上运行\调试MapReduce程序的方法,方便程序员尽快开始大数据MapReduce编程。

maven依赖

按规范新建maven项目,下面是我的pom:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">  <modelVersion>4.0.0</modelVersion>  <groupId>org.lanqiao</groupId>  <artifactId>bigData</artifactId>  <version>1.0</version>  <packaging>jar</packaging>  <properties>    <!--logger-->    <slf4j-api.version>1.7.25</slf4j-api.version>    <logback.version>1.2.3</logback.version>    <java.version>1.8</java.version>    <!--hadoop-core-->    <hadoop-core.version>1.2.1</hadoop-core.version>    <hadoop.version>2.6.5</hadoop.version>    <junit.version>4.12</junit.version>  </properties>  <dependencies>    <!--logger-->    <dependency>      <groupId>org.slf4j</groupId>      <artifactId>slf4j-api</artifactId>      <version>${slf4j-api.version}</version>    </dependency>    <!-- Hadoop main client artifact -->    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-client</artifactId>      <version>${hadoop.version}</version>    </dependency>    <!-- Unit test artifacts -->    <dependency>      <groupId>junit</groupId>      <artifactId>junit</artifactId>      <version>${junit.version}</version>      <scope>test</scope>    </dependency>    <dependency>      <groupId>org.assertj</groupId>      <artifactId>assertj-core</artifactId>      <version>3.6.2</version>      <scope>test</scope>    </dependency>    <dependency>      <groupId>org.apache.mrunit</groupId>      <artifactId>mrunit</artifactId>      <version>1.1.0</version>      <classifier>hadoop2</classifier>      <scope>test</scope>    </dependency>    <!-- Hadoop test artifact for running mini clusters -->    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-minicluster</artifactId>      <version>${hadoop.version}</version>      <scope>test</scope>    </dependency>  </dependencies>  <build>    <!--打包资源文件-->    <resources>      <resource>        <directory>src/main/java</directory>        <excludes>          <exclude>**/*.java</exclude>        </excludes>      </resource>      <resource>        <directory>src/main/resources</directory>      </resource>    </resources>    <plugins>      <!--编译插件-->      <plugin>        <groupId>org.apache.maven.plugins</groupId>        <artifactId>maven-compiler-plugin</artifactId>        <configuration>          <source>1.8</source>          <target>1.8</target>        </configuration>      </plugin>      <!--打包插件-->      <plugin>        <groupId>org.apache.maven.plugins</groupId>        <artifactId>maven-jar-plugin</artifactId>        <configuration>          <archive>            <manifest>              <mainClass>org.lanqiao.mr.WordCount</mainClass>              <!--<addClasspath>true</addClasspath>-->              <!--<classpathPrefix>lib/</classpathPrefix>-->            </manifest>          </archive>          <classesDirectory>          </classesDirectory>        </configuration>      </plugin>    </plugins>  </build></project>

For building MapReduce jobs, you only need to have the hadoop-client dependency, which contains all the Hadoop client-side classes needed to
interact with HDFS and MapReduce. For running unit tests, we use junit, and for writing MapReduce tests, we use mrunit. The hadoop-minicluster library contains the “mini-” clusters that are useful for testing with Hadoop clusters running in a single JVM.

WordCount程序

TokenizerMapper

package org.lanqiao.mr;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;import java.io.IOException;import java.util.stream.Stream;/** *  Mapper程序继承Mapper类; *  泛型四个类型参数分别是map函数的输入键、输入值、输出键、输出值的类型 */public  class TokenizerMapper    extends Mapper<LongWritable, Text, Text, IntWritable> {  //hadoop重新定义了一些区别于Java原生的数据类型,IntWritable是int的替代  private final  IntWritable one = new IntWritable(1);  //Text是String的替代  private Text word = new Text();  /*这里实现数据准备的逻辑  * 默认采用文本文件输入格式,框架将<行偏移量,改行文本>作为此函数的参数*/  public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {    //对value(一行文本)按字符分割,使用Java8流式处理,对每个单词用context写入<单词,1>键值对    //map运算完成后会有一个排序合并的过程,即键相同的所有value会合并成一个集合    //所有mapper的整体输出为<单词,[1,1,1,1...]>    Stream.of(value.toString().split("\\s|\\.|,|=")).forEach((e) -> {      try {        word.set(e);        context.write(word, one);      } catch (IOException e1) {        e1.printStackTrace();      } catch (InterruptedException e1) {        e1.printStackTrace();      }    });  }}

IntSumReducer

package org.lanqiao.mr;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import java.io.IOException;/** * Reducer程序继承Reducer类; * */public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {  private static Logger logger = LoggerFactory.getLogger(IntSumReducer.class);  /**实现数据聚合的逻辑   * @param key mapper的输出键   * @param values mapper输出键关联的数据集和*/  public void reduce(Text key, Iterable<IntWritable> values,                     Context context) throws IOException, InterruptedException {    logger.debug("正在统计单词" + key.toString());    int sum = 0;    //迭代values,把每个value(本例中每个value都是1)累加到sum    for (Iterator<IntWritable> iter = values.iterator();iter.hasNext();){      sum+=iter.next().get();    }    logger.debug("单词个数为:" + sum);    //hadoop只认IntWritable,不认int,这里将sum包装为IntWritable    //一个Reducer实例,只处理一个key,将结果写入Context    context.write(key, new IntWritable(sum));  }}

编写MapReduce的单元测试

开发阶段,我们并不想启动hdfs和集群,不想提交任务到云,而只是想测试下算法逻辑,这就需要用到单元测试了。

MRUnit

MRUnit是一款由Couldera公司开发的专门针对Hadoop中编写MapReduce单元测试的框架。

它可以用于0.18.x版本中的经典org.apache.hadoop.mapred.*的模型,也能在0.20.x版本org.apache.hadoop.mapreduce.*的新模型中使用。

官方的介绍如下:

MRUnit is a unit test library designed to facilitate easy integration between your MapReduce development process and standard development and testing tools such as JUnit. MRUnit contains mock objects that behave like classes you interact with during MapReduce execution (e.g., InputSplit and OutputCollector) as well as test harness “drivers” that test your program’s correctness while maintaining compliance with the MapReduce semantics. Mapper and Reducer implementations can be tested individually, as well as together to form a full MapReduce job.

MRUnit安装

<dependency>  <groupId>org.apache.mrunit</groupId>  <artifactId>mrunit</artifactId>  <version>1.1.0</version>  <classifier>hadoop2</classifier>  <scope>test</scope></dependency>

单测的代码

package org.lanqiao.mr;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mrunit.mapreduce.MapDriver;import org.apache.hadoop.mrunit.mapreduce.MapReduceDriver;import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;import org.junit.Before;import org.junit.Test;import java.io.IOException;import java.util.Arrays;public class WordCountTest {  MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;  ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver;  MapReduceDriver<LongWritable, Text, Text, IntWritable, Text, IntWritable> mapReduceDriver;  @Before  public void setUp() {    final TokenizerMapper mapper = new TokenizerMapper();    final IntSumReducer reducer = new IntSumReducer();    mapDriver = MapDriver.newMapDriver(mapper);    reduceDriver = ReduceDriver.newReduceDriver(reducer);    mapReduceDriver = MapReduceDriver.newMapReduceDriver(mapper, reducer);  }  @Test  public void testMapper() throws IOException {    mapDriver        .withInput(new LongWritable(0), new Text("zhangsan lisi zhangsan"))        .withOutput(new Text("zhangsan"), new IntWritable(1))        .withOutput(new Text("lisi"), new IntWritable(1))        .withOutput(new Text("zhangsan"), new IntWritable(1))        .runTest();  }  @Test  public void testReducer() throws IOException {    reduceDriver        .withInput(new Text("zhangsan"), Arrays.asList(new IntWritable(1), new IntWritable(1)))        .withOutput(new Text("zhangsan"), new IntWritable(2))        .runTest();  }  @Test  public void testMapperReducer() throws IOException {    mapReduceDriver        .withInput(new LongWritable(0), new Text("zhangsan lisi zhangsan"))        // .withInput(new LongWritable(1), new Text("hello  zhangsan"))        .withOutput(new Text("lisi"), new IntWritable(1))        .withOutput(new Text("zhangsan"), new IntWritable(2))        .runTest();  }}

这里我们贴出了针对WordCount的全部测试代码。
通过Junit的方式调用运行就可以了。这样就可以实现了在本地没有集群环境的方式下快速方便的进行MR功能测试。
这些代码都比较简单,withInput用于传递输入记录,withOutput设定期望的输出。如果期望的输出和真实的输出不一致,测试用例将失败。

局限

通过阅读MRUnit的源代码我们会发现:

  1. 不支持MapReduce框架中的分区和排序操作:从Map输出的值经过shuffle处理后直接就导入Reduce中了。
  2. 不支持Streaming实现的MapReduce操作。

虽然MRUnit有这些局限,但是足以完成大多数的需求。

参考资料

http://www.cloudera.com/hadoop-mrunit

小数据集,提交本地任务

WordCountDriver

package org.lanqiao.mr;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;import org.slf4j.Logger;import org.slf4j.LoggerFactory;public class WordCountDriver extends Configured implements Tool {  private static Logger logger = LoggerFactory.getLogger(WordCountDriver.class);  public static void main(String[] args) throws Exception {    final WordCountDriver driver = new WordCountDriver();    Configuration conf = new Configuration();    // 配置文件    conf.addResource("hadoop-local.xml");    driver.setConf(conf);    //输入数据的位置,可替换成小数据集样本    Path in=new Path("src/main/resources/log4j.properties");    //输出数据的位置    Path out = new Path("output");    //删除输出目录,因为hadoop不会覆盖已有的目录,如果目录存在会报错    FileSystem fs = FileSystem.get(conf);    fs.delete(out, true);    //运行任务    int exitCode = ToolRunner.run(driver, new String[]{in.toString(),out.toString()});    System.exit(exitCode);  }  @Override  public int run(String[] args) throws Exception {    Configuration conf = getConf();    // 处理main的参数    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();    if (otherArgs.length < 2) {      System.err.println("Usage: wordcount <in> [<in>...] <out>");      ToolRunner.printGenericCommandUsage(System.err);      System.exit(-1);    }    String jobName = "word count";    //job的各种设定    Job job = Job.getInstance(conf, jobName);//new Job(conf, "word count");    job.setJarByClass(getClass());    job.setMapperClass(TokenizerMapper.class);    job.setCombinerClass(IntSumReducer.class);    job.setReducerClass(IntSumReducer.class);    job.setOutputKeyClass(Text.class);    job.setOutputValueClass(IntWritable.class);    for (int i = 0; i < otherArgs.length - 1; ++i) {      FileInputFormat.addInputPath(job, new Path(otherArgs[i]));    }    FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length-1]));    return job.waitForCompletion(true) ? 0 : 1;  }}

hadoop-local.xml

<?xml version="1.0"?><configuration>  <property>    <name>fs.defaultFS</name>    <value>file:///</value>  </property>  <property>    <name>mapreduce.framework.name</name>    <value>local</value>  </property></configuration>

说明

main函数中用Configuration来控制任务该如何运行,在代码中写定了数据的输入和输出路径,最终使用ToolRunner来提交了任务。而Tool接口的run方法主要用于构建Job实例。

由于任务在本地提交,输入和输出路径都是在本地文件系统中,可以用肉眼查看输出路径下的结果数据。

Driver的单元测试

我们可以很轻松地把main方法改写为一个测试方法:

package org.lanqiao.mr;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.util.ToolRunner;import org.junit.Test;import static org.assertj.core.api.Assertions.assertThat;public class WordCountDriverTest {  @Test  public void run() throws Exception {    final WordCountDriver driver = new WordCountDriver();    Configuration conf = new Configuration();    // 配置文件    conf.addResource("hadoop-local.xml");    driver.setConf(conf);    //输入数据的位置,可替换成小数据集样本    Path in = new Path("src/main/resources/log4j.properties");    //输出数据的位置    Path out = new Path("output");    //删除输出目录,因为hadoop不会覆盖已有的目录,如果目录存在会报错    FileSystem fs = FileSystem.get(conf);    fs.delete(out, true);    //运行任务    int exitCode = ToolRunner.run(driver, new String[]{in.toString(), out.toString()});    // 这里使用断言    assertThat(exitCode).isEqualTo(0);    //checkOutput(conf, output); //这里可以检查输出文件是否符合预期  }}

小数据集,提交任务到集群(伪分布)

见上文,启动伪分布式集群的hdfs。

修改Driver的单测代码

// 其余代码不必改动    // 配置文件    // conf.addResource("hadoop-local.xml");    conf.addResource("hadoop-localhost.xml");//本地伪分布集群连接信息    driver.setConf(conf);    //输入数据的位置,可替换成小数据集样本    // Path in = new Path("src/main/resources/log4j.properties");    Path in = new Path("input/log4j.properties");// hdfs文件路径    //输出数据的位置    Path out = new Path("output");//其余代码省略……

我们只需把配置文件替换成hadoop-localhost.xml,然后注意数据集的位置是hdfs文件系统上的路径,而且我们应事先把数据上传。

hadoop-localhost.xml的内容如下:

<?xml version="1.0"?><configuration>  <property>    <name>fs.defaultFS</name>    <value>hdfs://localhost:9000/</value>  </property></configuration>

这必须和我们在上一章中搭建伪分布式环境时用到的配置一样(地址和端口)。

总结

有很多初学者,一开始就被复杂的hadoop环境迷惑,事实上我们几乎无需任何环境就能开发MapReduce程序。
引入hadoop-client的API后就可以编写MapReduce程序,然后利用mrunit-API就可以对Mapper和Reducer进行单元测试。
进一步我们可以用Local Job Runner来分析本地小数据集样本,并查看分析结果。
再进一步,我们可以把任务提交给一个集群。不过本文示例是从开发角度出发的。在生产环节下,没有IDE,也不会把参数写死,我们需要将Job打包部署,这些内容待集群部分再深入研讨。
目前,我们能开发MapReduce程序就足够了。

原创粉丝点击