【Hadoop】数据序列化系统Avro
来源:互联网 发布:初级网络优化工程师 编辑:程序博客网 时间:2024/05/22 01:55
- Avro简介
- schema
- 文件组成
- Header与Datablock声明代码
- 测试代码
- 序列化与反序列化
- specific
- generic
- 参考资料
1. Avro简介
Avro是由Doug Cutting(Hadoop之父)创建的数据序列化系统,旨在解决Writeable类型的不足:缺乏语言的可移植性。为了支持跨语言,Avro的schema与语言的模式无关。有关Avro的更多特性请参看官方文档 1。
Avro文件的读写是依据schema而进行的。通常情况下,Avro的schema是用JSON编写,而数据部分则是二进制格式编码,并采用压缩算法对数据进行压缩,以便减少传输量。
schema
schema中数据字段的类型包括两种
- 原生类型(primitive types): null, boolean, int, long, float, double, bytes, and string
- 复杂类型(complex types): record, enum, array, map, union, and fixed
复杂类型比较常用的record。这里用[2]中twitter.avro文件为例,打开文件后,文件头如下:
Objavro.codecnullavro.schemaò{“type”:”record”,”name”:”twitter_schema”,”namespace”:”com.miguno.avro”,”fields”:[{“name”:”username”,”type”:”string”,”doc”:”Name of the user account on Twitter.com”},{“name”:”tweet”,”type”:”string”,”doc”:”The content of the user’s Twitter message”},{“name”:”timestamp”,”type”:”long”,”doc”:”Unix epoch time in milliseconds”}],”doc:”:”A basic schema for storing Twitter messages”}
将schema格式化之后
{ "type": "record", "name": "twitter_schema", "namespace": "com.miguno.avro", "fields": [ { "name": "username", "type": "string", "doc": "Name of the user account on Twitter.com" }, { "name": "tweet", "type": "string", "doc": "The content of the user's Twitter message" }, { "name": "timestamp", "type": "long", "doc": "Unix epoch time in milliseconds" } ], "doc:": "A basic schema fostoring Twitter messages"}
其中,name是该JSON串的名字,type是指明name的类型,doc是对该name更为详细的说明。
2. 文件组成
3中的图对Avro文件进行详细地描述,一个文件由header与多个data block组成。header主要由MetaDatas
与16位sync marker
组成,MetaDatas中的信息包含codec
与schema;codec是data block中的数据采用的压缩方式,为null
(不压缩)或者是deflate
。deflate算法是gzip所采用的压缩算法,就我自己感觉而言压缩比在6倍以上(具体还没研究过)。其实每个data block间都会间隔一个sync marker,具体参看4。sync marker是为了用于mapReduce阶段时文件分割与同步;此外Avro本身是为了mapReduce而设计的。
Header与Datablock声明代码
Header与Datablock声明代码在Avro源码org.apache.avro.file.DataFileStream.java中给出。
//org.apache.avro.file.DataFileStream.java public static final class Header { Schema schema; Map<String,byte[]> meta = new HashMap<String,byte[]>(); private transient List<String> metaKeyList = new ArrayList<String>(); byte[] sync = new byte[DataFileConstants.SYNC_SIZE]; //byte[16] private Header() {} } static class DataBlock { private byte[] data; private long numEntries; private int blockSize; private int offset = 0; private boolean flushOnWrite = true; private DataBlock(long numEntries, int blockSize) { this.data = new byte[blockSize]; this.numEntries = numEntries; this.blockSize = blockSize; }
测试代码
下面给出是关于Header与DataBlock的测试代码。得到schema的方式有两种:
- getSchema()直接返回Header.schema;
- getMetaString(“avro.schema”)从
Map<String,byte[]> meta
中的得到byte类型的schema然后转成String。
Map<String,byte[]> meta
的keySet为[“avro.codec”, “avro.schema”]。
DataFileReader<Void> reader = new DataFileReader<Void>(new FsInput(new Path("twitter.avro"), new Configuration()), new GenericDatumReader<Void>());//print schemaSystem.out.println(reader.getSchema().toString(true));//print meta List<String> metaKeyList = reader.getMetaKeys();System.out.println(metaKeyList.toString());System.out.println(reader.getMetaString("avro.codec"));System.out.println(reader.getMetaString("avro.schema"));//print blockountreader.getBlockCount();//print the data in data blockSystem.out.println(reader.next());
3. 序列化与反序列化
官网上给出了两种序列化方式:specific与generic。
specific
specific方式是根据所生成的User类,提取出schema来进行Avro的解析。
// Serialize user1, user2 and user3 to diskDatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter);dataFileWriter.create(user1.getSchema(), new File("users.avro"));dataFileWriter.append(user1);dataFileWriter.append(user2);dataFileWriter.append(user3);dataFileWriter.close();// Deserialize Users from diskDatumReader<User> userDatumReader = new SpecificDatumReader<User>(User.class);DataFileReader<User> dataFileReader = new DataFileReader<User>(file, userDatumReader);User user = null;while (dataFileReader.hasNext()) {// Reuse user object by passing it to next(). This saves us from// allocating and garbage collecting many objects for files with// many items.user = dataFileReader.next(user);System.out.println(user);}
generic
generic方式是预先生成了一个schema,然后再根据其解析。因为Avro文件会将schema写在文件头,所以generic解析方式更为常见。
GenericRecord user1 = new GenericData.Record(schema);user1.put("name", "Alyssa");user1.put("favorite_number", 256);// Leave favorite color nullGenericRecord user2 = new GenericData.Record(schema);user2.put("name", "Ben");user2.put("favorite_number", 7);user2.put("favorite_color", "red");// Serialize user1 and user2 to diskFile file = new File("users.avro");DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);dataFileWriter.create(schema, file);dataFileWriter.append(user1);dataFileWriter.append(user2);dataFileWriter.close();// Deserialize users from diskDatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(schema);DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, datumReader);GenericRecord user = null;while (dataFileReader.hasNext()) { // Reuse user object by passing it to next(). This saves us from // allocating and garbage collecting many objects for files with // many items. user = dataFileReader.next(user); System.out.println(user);}
avro-tools的jar包提供了对Avro文件丰富的操作,包括对Avro文件进行切割,以用于做测试数据。
Available tools: compile Generates Java code for the given schema. concat Concatenates avro files without re-compressing. fragtojson Renders a binary-encoded Avro datum as JSON. fromjson Reads JSON records and writes an Avro data file. fromtext Imports a text file into an avro data file. getmeta Prints out the metadata of an Avro data file. getschema Prints out schema of an Avro data file. idl Generates a JSON schema from an Avro IDL file induce Induce schema/protocol from Java class/interface via reflection. jsontofrag Renders a JSON-encoded Avro datum as binary. recodec Alters the codec of a data file. rpcprotocol Output the protocol of a RPC service rpcreceive Opens an RPC Server and listens for one message. rpcsend Sends a single RPC message. tether Run a tethered mapreduce job. tojson Dumps an Avro data file as JSON, one record per line. totext Converts an Avro data file to a text file. trevni_meta Dumps a Trevni file's metadata as JSON.trevni_random Create a Trevni file filled with random instances of a schema.trevni_tojson Dumps a Trevni file as JSON.
参考资料
- Apache Avro documentation. ↩
- miguno, avro-cli-examples. ↩
- xyw_Eliot, Avro简介. ↩
- guibin, AVRO文件结构分析. ↩
- 【Hadoop】数据序列化系统Avro
- AVRO 数据序列化系统学习笔记
- Avro数据序列化
- Hadoop中数据序列化的常用方式:SequenceFile, Avro, Thrift, ProtoBuff -- 待完善
- Kafka 生产消费 Avro 序列化数据
- Avro (数据序列化) 基础概念
- Java 序列化的高级认识 hadoop序列化 avro
- Hadoop之Avro序列化相关类图
- Avro 序列化
- Avro JSON 序列化
- 5.2 Avro序列化
- 一个完整的Avro数据序列化例子
- 数据序列化avro\thrift\protocol buffer | kyro
- AVRO—跨语言数据序列化框架
- Avro总结(RPC/序列化)
- Avro总结(RPC/序列化)
- Avro总结(RPC/序列化)
- Avro总结(RPC/序列化)
- 1041. Be Unique (20)
- 极速前进
- 斐波那契数列
- POSIX多线程--线程基本操作接口
- eclipse git 配置 远程上传按钮无法使用问题
- 【Hadoop】数据序列化系统Avro
- selenium支持的Firefox
- 百度地图笔记 图层类
- java.sql.SQLException: ORA-01578: ORACLE 数据块损坏问题解决办法
- CentOS VNCServer安装
- 斜线表头
- YAHOO 35条前端优化建议
- 一个字符串常量实验引发的思考
- strcpy,strcat,memcpy和memmove的实现