Avro笔记 Avro：MapReduce应用谷歌的 protobuf

来源：互联网发布：朗诵软件手机版编辑：程序博客网时间：2024/06/12 20:21

“namespace”后的值对应的是包名

编译avsc文件

复制粘贴到maven项目下

user是他生成的，直接就可以创建对象使用了

new File();你可以自己指定个路径，上面代码是相对的。

反串行：

如果未生成源代码，直接通过Schema进行串行和反串行：

整合MR：

Apache Avro框架提供：

丰富的数据类型（原始类型和复杂类型）
紧凑、快速的二进制文件格式（.avro）
一种容器文件，用于存储avro数据
RPC
容易与动态语言集成，无需生成代码。代码生成作为一种优化，只有在静态语言中使用才有价值。

下面是一个Avro MapReduce的实例，MapReduce作业统计Avro文件中的数据。Avro文件中的对象Schema如下：

{"namespace": "me.lin.avro.mapreduce", "type": "record", "name": "User", "fields": [     {"name": "name", "type": "string"},     {"name": "favorite_number",  "type": ["int", "null"]},     {"name": "favorite_color", "type": ["string", "null"]} ]}1
2
3
4
5
6
7
8
9

我们首先使用代码生成数据。生成的avro使用avro-tool查看，示例如下：

这里写图片描述

我们要统计给定的文件中喜欢不同颜色的人数，结果如下：

这里写图片描述

Maven项目创建

我们使用Maven来创建项目，POM文件如下：

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">  <modelVersion>4.0.0</modelVersion>  <groupId>me.lin.avro</groupId>  <artifactId>avro-mapreduce</artifactId>  <version>0.0.1-SNAPSHOT</version>  <build>    <plugins>      <plugin>        <groupId>org.apache.maven.plugins</groupId>        <artifactId>maven-compiler-plugin</artifactId>        <configuration>          <source>1.7</source>          <target>1.7</target>        </configuration>      </plugin>      <plugin>        <artifactId>maven-assembly-plugin</artifactId>        <executions>          <execution>            <phase>package</phase>            <goals>              <goal>single</goal>            </goals>          </execution>        </executions>        <configuration>          <descriptorRefs>            <descriptorRef>jar-with-dependencies</descriptorRef>          </descriptorRefs>        </configuration>      </plugin>      <plugin>        <groupId>org.apache.avro</groupId>        <artifactId>avro-maven-plugin</artifactId>        <version>1.8.1</version>        <executions>          <execution>            <phase>generate-sources</phase>            <goals>              <goal>schema</goal>            </goals>            <configuration>              <sourceDirectory>${project.basedir}/../</sourceDirectory>              <outputDirectory>${project.basedir}/src/main/java</outputDirectory>            </configuration>          </execution>        </executions>      </plugin>    </plugins>  </build>  <dependencies>    <dependency>      <groupId>org.apache.avro</groupId>      <artifactId>avro</artifactId>      <version>1.8.1</version>    </dependency>    <dependency>      <groupId>org.apache.avro</groupId>      <artifactId>avro-mapred</artifactId>      <version>1.8.1</version>    </dependency>    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-client</artifactId>      <version>2.6.0</version>    </dependency>  </dependencies></project>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74

生成代码

我们根据Schema生成User类，运行mvn package命令即可。在源码目录下看到生成了User及相关的饿内部类：

这里写图片描述

生成随机数据

接下来我们使用上述的Schema，随机生成喜欢不同颜色的数据，工具类如下：

package me.lin.avro.mapreduce;import java.io.File;import java.io.IOException;import java.util.Random;import org.apache.avro.file.DataFileWriter;import org.apache.avro.io.DatumWriter;import org.apache.avro.specific.SpecificDatumWriter;public class GenerateData {      public static final String[] COLORS = {"red", "orange", "yellow", "green", "blue", "purple", null};      public static final int USERS = 100;      public static final String PATH = "./input/users.avro";      public static void main(String[] args) throws IOException {        // Open data file        File file = new File(PATH);        if (file.getParentFile() != null) {          file.getParentFile().mkdirs();        }        DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);        DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter);        dataFileWriter.create(User.SCHEMA$, file);        // Create random users        User user;        Random random = new Random();        for (int i = 0; i < USERS; i++) {          user = new User("user", null, COLORS[random.nextInt(COLORS.length)]);          dataFileWriter.append(user);          System.out.println(user);        }        dataFileWriter.close();      }    }1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

可以指定生成多少条记录，我们这里生成100条，生成的users.avro文件中包含的数据就是前面提到的。

Mapper定义

Avro Mapper主要是在输入输出的数据类型上使用Avro，具体代码如下：

public static class ColorCountMapper extends            Mapper<AvroKey<User>, NullWritable, Text, IntWritable> {        @Override        public void map(AvroKey<User> key, NullWritable value, Context context)                throws IOException, InterruptedException {            CharSequence color = key.datum().getFavoriteColor();            if (color == null) {                color = "none";            }            context.write(new Text(color.toString()), new IntWritable(1));        }    }1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

这个Map将avro文件中的记录作为key，value为空。输出颜色及数量给Reducer。

Reducer定义

Reducer统计各颜色的人数：

public static class ColorCountReducer            extends            Reducer<Text, IntWritable, AvroKey<CharSequence>, AvroValue<Integer>> {        @Override        public void reduce(Text key, Iterable<IntWritable> values,                Context context) throws IOException, InterruptedException {            int sum = 0;            for (IntWritable value : values) {                sum += value.get();            }            context.write(new AvroKey<CharSequence>(key.toString()),                    new AvroValue<Integer>(sum));        }    }1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

统计，然后将颜色-数量数据对输出到avro文件。

作业配置

@Override    public int run(String[] args) throws Exception {        if (args.length != 2) {            System.err                    .println("Usage: MapReduceColorCount <input path> <output path>");            return -1;        }        Job job = new Job(getConf());        job.setJarByClass(MapReduceColorCount.class);        job.getConfiguration().setBoolean( Job.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true);        job.setJobName("Color Count");        FileInputFormat.setInputPaths(job, new Path(args[0]));        FileOutputFormat.setOutputPath(job, new Path(args[1]));        job.setInputFormatClass(AvroKeyInputFormat.class);        job.setMapperClass(ColorCountMapper.class);        AvroJob.setInputKeySchema(job, User.getClassSchema());        job.setMapOutputKeyClass(Text.class);        job.setMapOutputValueClass(IntWritable.class);        job.setOutputFormatClass(AvroKeyValueOutputFormat.class);        job.setReducerClass(ColorCountReducer.class);        AvroJob.setOutputKeySchema(job, Schema.create(Schema.Type.STRING));        AvroJob.setOutputValueSchema(job, Schema.create(Schema.Type.INT));        return (job.waitForCompletion(true) ? 0 : 1);    }public static void main(String[] args) throws Exception {        int res = ToolRunner.run(new MapReduceColorCount(), args);        System.exit(res);    }1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

打包

运行mvn package，项目下的target目录得到如下打包好的代码：

这里写图片描述

为了方便，我们将依赖包也打进来，这是在assembly插件中配置的，将含依赖的jar包和生成的数据上传到Hadoop集群的客户端机器。

运行

将前面生成的avro数据上传到HDFS：

hadoop fs -copyFromLocal /opt/job/ -ls /input/avro1

提交作业运行：

hadoop jar avro-mapreduce-0.0.1-SNAPSHOT.jar  me.lin.avro.mapreduce.MapReduceColorCount /input/avro/users.avro /output/avro/mr1

这里写图片描述

在运行作业的时候，一开始出现如下错误：

这里写图片描述

经过排查发现是Hadoop中的avro版本与我们的代码中使用的版本不一致。Hadoop中的版本为1.7.4，这个版本中是没有createDatumWriter方法的。为了使得作业运行时优先使用我们的jar包，而不是Hadoop的jar包（/share/mapreduce/lib/avro-1.7.4.jar）,我们将作业设置为优先使用用户jar包：

job.getConfiguration().setBoolean( Job.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true);1

对应的属性名称为mapreduce.job.user.classpath.first.

------------------------------------------------------------------------------------------------------------------------------------

在MapReduce作业中，框架保证Reducer收到的key是有序的。利用这一点，我们可以对Avro文件进行排序。

假设我们有如下的Schema：

{"namespace": "me.lin.avro.mapreduce", "type": "record", "name": "User", "fields": [     {"name": "name", "type": "string"},     {"name": "favorite_number",  "type": ["int", "null"]},     {"name": "favorite_color", "type": ["string", "null"]  ]}1
2
3
4
5
6
7
8
9

现在我们有一个无需的avro文件，样例如下：

这里写图片描述

如何生成这些数据，请参考这里。

我们现在需求是根据favorite_color进行排序。

Avro不同于其他序列化框架的地方之一就是读写的Schema可以不一样。用前面的Schema写入的文件，我们再读取的时候，可以使用另一个兼容的Schema来读取，为了排序，我们再favorite_color字段加上顺序：

{"namespace": "me.lin.avro.mapreduce", "type": "record", "name": "User", "fields": [     {"name": "name", "type": "string"},     {"name": "favorite_number",  "type": ["int", "null"]},     {"name": "favorite_color", "type": ["string", "null"] , "order":"descending"} ]}1
2
3
4
5
6
7
8
9

注意order设置为descending倒序，也就是根据字母倒序。我们使用这个带顺序的Schema去读取原来没有顺序的文件，并利用MapReduce的shuffle过程中会进行排序这一点，实现最终的排序目标。

Mapper和Reducer

Mapper的逻辑是读取文件中的记录（作为key输入），针对每一对key-value，往Context中写入key-key的形式，也就是说将输入的key同时作为输出，传送到Reducer。代码如下：

public static class SortMapper<K> extends            Mapper<AvroKey<K>, NullWritable, AvroKey<K>, AvroValue<K>> {        @Override        protected void map(AvroKey<K> key, NullWritable value, Context context)                throws IOException, InterruptedException {            context.write(key, new AvroValue<K>(key.datum()));        }    }1
2
3
4
5
6
7
8
9
10

Reducer收到根据key分值的键值对，是根据key排序过的，由于Mapper中每一个key对应输出了相同值的value，因此对于Reducer收到的一个key，其列表值也是对应有序的，直接输出到avro文件即可完成排序，代码如下：

public static class SortReducer<K> extends            Reducer<AvroKey<K>, AvroValue<K>, AvroKey<K>, NullWritable> {        @Override        protected void reduce(AvroKey<K> key, Iterable<AvroValue<K>> values,                Context context) throws IOException, InterruptedException {            for (AvroValue<K> val : values) {                context.write(new AvroKey<K>(val.datum()), NullWritable.get());            }        }    }1
2
3
4
5
6
7
8
9
10
11
12

运行作业

我们通过Tool来运行作业，代码如下：

    @Override    public int run(String[] args) throws Exception {        if (args.length != 3) {            System.err                    .printf("Usage: %s [generic options] <input> <output> <schema-file>\n",                            getClass().getSimpleName());            ToolRunner.printGenericCommandUsage(System.err);            return -1;        }        String input = args[0];        String output = args[1];        String schemaFile = args[2];        @SuppressWarnings("deprecation")        Job job = new Job(getConf());        job.setJarByClass(getClass());        job.getConfiguration().setBoolean(                Job.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true);        FileInputFormat.setInputPaths(job, new Path(input));        FileOutputFormat.setOutputPath(job, new Path(output));        AvroJob.setDataModelClass(job, GenericData.class);        Schema schema = new Schema.Parser().parse(new File(schemaFile));        AvroJob.setInputKeySchema(job, schema);        AvroJob.setMapOutputKeySchema(job, schema);        AvroJob.setMapOutputValueSchema(job, schema);        AvroJob.setOutputKeySchema(job, schema);        job.setInputFormatClass(AvroKeyInputFormat.class);        job.setOutputFormatClass(AvroKeyOutputFormat.class);        job.setOutputKeyClass(AvroKey.class);        job.setOutputValueClass(NullWritable.class);        job.setMapperClass(SortMapper.class);        job.setReducerClass(SortReducer.class);        return job.waitForCompletion(true) ? 0 : 1;    }    public static void main(String... args) throws Exception {        int exitCode = ToolRunner.run(new AvroSort(), args);        System.exit(exitCode);    }1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

所有代码都放在AvroSort.java中。

将项目打包，准备好待排序文件及有序的Schema，然后运行作业：

hadoop jar  avro-mapreduce-0.0.1-SNAPSHOT-jar-with-dependencies.jar  me.lin.avro.mapreduce.AvroSort /input/avro/users.avro /output/avro/mr-sort ./user.avsc1

入口类AvroSort接受三个参数：输入文件，输出目录及schema文件。注意shcema文件配置的路径是当前机器的路径，不是HDFS上的路径。运行效果如下：

这里写图片描述

使用Avro tool包的tojson工具查看排序后的结果：

java -jar avro-tools-1.8.1.jar tojson part-r-00000-sort.avro1

这里写图片描述

可以看到是根据favorite_color倒排的。

protobuf（传的是字段编号）：

下载解压配置环境变量path；

那个点是输出路径，后面是文件路径

阅读全文

1 0

Avro笔记 Avro：MapReduce应用 谷歌的 protobuf

Maven项目创建

生成代码

生成随机数据

Mapper定义

Reducer定义

作业配置

打包

运行

Mapper和Reducer

运行作业

Avro笔记 Avro：MapReduce应用谷歌的 protobuf