5.2 Avro序列化

来源:互联网 发布:java输出倒三角乘法表 编辑:程序博客网 时间:2024/06/02 01:57

5.2 Avro序列化

  Avro是一个流行的序列化框架,其主要特点如下:

  • 支持多种数据结构的序列化。
  • 支持多种编程语言,而且序列化速度快,字节紧凑。
  • Avro代码生成功能是可选的。无需生成类或代码,即可读写数据或使用RPC传输数据。

  Avro使用schema来读取和写入数据。schema有助于简洁标识序列化后的对象。在Java序列化中,对象类型的元数据会被写入序列化后的字节流中,而schema的自解释能力可以避免这么做。JavaScript对象表示法(JSON)用于描述schema,这是一种在网络编程中很流行的对象表示法。在处理数据时,通过新旧schema并存的方式来应对schema的变化。

  以下是Avro中的两个schema文件。第一个文件是worldcitiespop.txt文件对应的schema文件,第二个文件是countrycodes.txt对应的schema文件:
worldcitiespop.avschema

{"namespace": "MasteringHadoop.avro", "type": "record", "name": "City", "fields": [     {"name": "countryCode", "type": "string"},     {"name": "cityName",  "type": "string"},     {"name": "cityFullName", "type": "string"},     {"name": "regionCode", "type": ["int","null"]},     {"name": "population", "type": ["long", "null"]},     {"name": "latitude", "type": ["float", "null"]},     {"name": "longitude", "type": ["float", "null"]} ]}

allcountries.avschema

{"namespace": "MasteringHadoop.avro", "type": "record", "name": "Country", "fields": [     {"name": "countryCode", "type": "string"},     {"name": "countryName",  "type": "string"} ]}

  Schema是自解释的,同时JSON表示法提高了可读性。Avro支持所有标准的原生数据类型,另外,Avro还支持复合数据类型,如联合(union)。Null值字段是null和其字段类型的联合。联合在语法的形式上表现为JSON数组。

  我们用之前定义的City schema,把基于CSV文本格式的文件worldcitiespop.txt转换成Avro文件。以下代码演示了写入Avro文件的重要步骤。静态方法CsvToAvro包含主要的转换代码。这个方法获取参数csvFilePath,avroFilePath(输出文件的路径)和schema文件的存放路径。Avro中有个特别的Schema类,对schema文件的解析就是初始化该类的对象。schema不会生成代码,所以我们使用GenericRecord来初始化schema,并用它来写入数据点。如果schema被用来生成代码,那么结果就是City类,会和其他Java类一样,直接导入(import)到以下代码中。

  DataFileWriter类把实际记录写入到文件。它有个create方法,用于创建Avro的输出文件。使用BufferedReader对象,可以让我们从CSV文件中一次一行地读取每个城市记录。getCity辅助方法读取一行,然后以符号逗号把一行切分为各个标记字符串,并产生一个GenericRecord对象。GenericData.Record类用于实例化Avro记录,其构造函数的参数是一个Schema对象。

  调用put方法并传入参数,记录字段名和对应的值,就可写入GenericRecord对象。isNumeric方法用于验证经过标记处理后的字符串是否是数字。坏记录会被跳过,从而不会被写入Avro文件。如果某个字段没有使用put方法进行设值,那么这个字段的值会被认为是null:
MasteringHadoopCsvToAvro.java

package MasteringHadoop;import org.apache.avro.Schema;import org.apache.avro.file.DataFileWriter;import org.apache.avro.generic.GenericData;import org.apache.avro.generic.GenericDatumWriter;import org.apache.avro.generic.GenericRecord;import org.apache.avro.io.DatumWriter;import java.io.*;public class MasteringHadoopCsvToAvro {   public static void CsvToAvro(String csvFilePath, String avroFilePath, String schemaFile) throws IOException{        //Read the schema        Schema schema  = (new Schema.Parser()).parse(new File(schemaFile));        File avroFile = new File(avroFilePath);        DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);        DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter);        dataFileWriter.create(schema,avroFile);        BufferedReader bufferedReader = new BufferedReader(new FileReader(csvFilePath));        String commaSeparatedLine;        while((commaSeparatedLine = bufferedReader.readLine()) != null){            GenericRecord city = getCountry(commaSeparatedLine, schema);            if(city != null)                dataFileWriter.append(city);        }        dataFileWriter.close();    }    private static GenericRecord getCountry(String commaSeparatedLine, Schema schema){        GenericRecord country = null;        String[] tokens = commaSeparatedLine.split(",");        if(tokens.length == 2){            country = new GenericData.Record(schema);            country.put("countryCode", tokens[0]);            country.put("countryName", tokens[1]);        }        return country;    }    private static GenericRecord getCity(String commaSeparatedLine, Schema schema){         GenericRecord city = null;         String[] tokens = commaSeparatedLine.split(",");        //Filter out the bad tokens         if(tokens.length == 7){             city = new GenericData.Record(schema);             city.put("countryCode", tokens[0]);             city.put("cityName", tokens[1]);             city.put("cityFullName", tokens[2]);             if(tokens[3] != null && tokens[3].length() > 0 && isNumeric(tokens[3])){                 city.put("regionCode", Integer.parseInt(tokens[3]));             }             if(tokens[4] != null && tokens[4].length() > 0 && isNumeric(tokens[4])){                city.put("population", Long.parseLong(tokens[4]));             }             if(tokens[5] != null && tokens[5].length() > 0 && isNumeric(tokens[5])){                 city.put("latitude", Float.parseFloat(tokens[5]));             }             if(tokens[6] != null && tokens[6].length() > 0 && isNumeric(tokens[6])){                 city.put("longitude", Float.parseFloat(tokens[6]));             }         }         return city;    }    public static void main(String[] args){         try{             CsvToAvro(args[0], args[1], args[2]);         }         catch(IOException iox){             iox.printStackTrace();         }         System.out.println("Task has Finished!");    }    public static boolean isNumeric(String str){        try{            double d = Double.parseDouble(str);        }        catch(NumberFormatException nfe){            return false;        }        return true;    }}

执行参数:

./input/countrycodes.txt ./output/countrycodes.avro  ./input/allcountries.avschema

**5.2.1 Avro与MapReduce

  Hadoop广泛支持在MapReduce作业中使用Avro序列化和反序列化。在Hadoop 1.x中,需要使用特殊的类,AvroMapper与AvroReducer。然而,在Hadoop 2.x中,只需重用内置的Mapper与Reducer类即可。AvroKey可以作为Mapper与Reducer类的输入或输出类型。

  AvroKeyInputFormat是一个特殊的InputFormat类,用于从输入文件中读取AvroKey。worldcitiespop.avro由之前的程序生成,以下代码读取这个文件并计算每个国家的人口数。
MasteringHadoopAvroMapReduce.java

package MasteringHadoop;import org.apache.avro.Schema;import org.apache.avro.generic.GenericRecord;import org.apache.avro.mapred.AvroKey;import org.apache.avro.mapreduce.AvroJob;import org.apache.avro.mapreduce.AvroKeyInputFormat;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;import java.io.File;import java.io.IOException;import java.net.URI;import java.net.URISyntaxException;public class MasteringHadoopAvroMapReduce {    private static String citySchema = "{\"namespace\": \"MasteringHadoop.avro\",\n" +            " \"type\": \"record\",\n" +            " \"name\": \"City\",\n" +            " \"fields\": [\n" +            "     {\"name\": \"countryCode\", \"type\": \"string\"},\n" +            "     {\"name\": \"cityName\",  \"type\": \"string\"},\n" +            "     {\"name\": \"cityFullName\", \"type\": \"string\"},\n" +            "     {\"name\": \"regionCode\", \"type\": [\"int\",\"null\"]},\n" +            "     {\"name\": \"population\", \"type\": [\"long\", \"null\"]},\n" +            "     {\"name\": \"latitude\", \"type\": [\"float\", \"null\"]},\n" +            "     {\"name\": \"longitude\", \"type\": [\"float\", \"null\"]}\n" +            " ]\n" +            "}";    public static class MasteringHadoopAvroMapper extends Mapper<AvroKey<GenericRecord>, NullWritable, Text, LongWritable>{        private Text ccode = new Text();        private LongWritable population = new LongWritable();        private String inputSchema;        @Override        protected void setup(Context context) throws IOException, InterruptedException {            inputSchema = context.getConfiguration().get("citySchema");         }        @Override        protected void map(AvroKey<GenericRecord> key, NullWritable value, Context context) throws IOException, InterruptedException {            GenericRecord record = key.datum();            String countryCode = (String) record.get("countryCode");            Long cityPopulation = (Long) record.get("population");            if(cityPopulation != null){                ccode.set(countryCode);                population.set(cityPopulation.longValue());                context.write(ccode, population);            }        }    }    public static class MasteringHadoopAvroReducer extends Reducer<Text, LongWritable, Text, LongWritable>{        private LongWritable total = new LongWritable();        @Override        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {            long totalPopulation = 0;            for(LongWritable pop : values){                totalPopulation += pop.get();            }            total.set(totalPopulation);            context.write(key, total);        }    }    public static void main(String args[]) throws IOException, InterruptedException, ClassNotFoundException, URISyntaxException{        GenericOptionsParser parser = new GenericOptionsParser(args);        Configuration config = parser.getConfiguration();        String[] remainingArgs = parser.getRemainingArgs();        config.set("citySchema", citySchema);        Job job = Job.getInstance(config, "MasteringHadoop-AvroMapReduce");        job.setMapOutputKeyClass(AvroKey.class);        job.setMapOutputValueClass(Text.class);        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(LongWritable.class);        job.addCacheFile(new URI(remainingArgs[2]));        job.setMapperClass(MasteringHadoopAvroMapper.class);        job.setReducerClass(MasteringHadoopAvroReducer.class);        job.setNumReduceTasks(1);        Schema schema  = (new Schema.Parser()).parse(new File(remainingArgs[2]));        AvroJob.setInputKeySchema(job, schema);        job.setInputFormatClass(AvroKeyInputFormat.class);        job.setOutputFormatClass(TextOutputFormat.class);        AvroKeyInputFormat.addInputPath(job, new Path(remainingArgs[0]));        TextOutputFormat.setOutputPath(job, new Path(remainingArgs[1]));        job.waitForCompletion(true);    }}&emsp;&emsp;我们把schema信息作为字符串,通过另一种方式进行传播。在以上代码中,通过对Configuration对象设置一个键进行传播。当然,DistributedCache也可用于传播schema文件。setup方法重写后用于在Map任务中读取schema。

运行参数:

./input/worldcitiespop.avro ./output ./input/worldcitiespop.avschema

程序运行有错误!

5.2.2 Avro与Pig

5.2.3 Avro与Hive

**5.2.4 Avro与Protocol Buffers/Thrift

原创粉丝点击