【Hadoop】Avro源码分析(二):反序列化之Generic
来源:互联网 发布:php ci3.0 restful 编辑:程序博客网 时间:2024/05/11 11:38
- 文件读取
- 类图继承
- DataFileStream
- DataFileReader
- Header与Data Block读取
- 初始化Header
- Data Block读取
本文着重研究avro反序列化的Generic方式,先从下面这段读取avro文件(反序列化)的代码开始讲起。
File file = new File("e://twitter.avro");DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>();DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, datumReader);GenericRecord datum = null;while(dataFileReader.hasNext()) { datum = dataFileReader.next(datum); System.out.println(datum);}
1. 文件读取
类图继承
DataFileReader是继承DataFileStream实现FileReader接口,
public class DataFileReader<D> extends DataFileStream<D> implements FileReader<D>
类图如下:DataFileStream
DataFileStream的字段:
private DatumReader<D> reader;private long blockSize;private boolean availableBlock = false;private Header header;/** Decoder on raw input stream. (Used for metadata.) */BinaryDecoder vin;/** Secondary decoder, for datums. * (Different than vin for block segments.) */BinaryDecoder datumIn = null;ByteBuffer blockBuffer;long blockCount; // # entries in blocklong blockRemaining; // # entries remaining in blockbyte[] syncBuffer = new byte[DataFileConstants.SYNC_SIZE];private Codec codec;
其中,vin用来读取header,datumIn用来读取data block的datum。
DataFileReader
DataFileReader的构造器通过读入File返回SeekableInputStream、DatumReader进行构造。
/** Construct a reader for a file. */public DataFileReader(File file, DatumReader<D> reader) throws IOException { this(new SeekableFileInput(file), reader);}/** Construct a reader for a file. */public DataFileReader(SeekableInput sin, DatumReader<D> reader) throws IOException { super(reader); this.sin = new SeekableInputStream(sin); initialize(this.sin); blockFinished();}
其中,initialize()是对header进行初始化,详细介绍见下。SeekableInput是接口,
class SeekableFileInput extends FileInputStream implements SeekableInput static class SeekableInputStream extends InputStream implements SeekableInput
其继承关系:
2. Header与Data Block读取
在前一篇中提到,avro文件由header与多个data block组成。
初始化Header
读取文件后,初始化header是在DataFileStream的initialize()方法中实现。
/** Initialize the stream by reading from its head. */void initialize(InputStream in) throws IOException { this.header = new Header(); this.vin = DecoderFactory.get().binaryDecoder(in, vin); byte[] magic = new byte[DataFileConstants.MAGIC.length]; try { vin.readFixed(magic); // read magic } catch (IOException e) { throw new IOException("Not a data file."); } if (!Arrays.equals(DataFileConstants.MAGIC, magic)) throw new IOException("Not a data file."); long l = vin.readMapStart(); // read meta data if (l > 0) { do { for (long i = 0; i < l; i++) { String key = vin.readString(null).toString(); ByteBuffer value = vin.readBytes(null); byte[] bb = new byte[value.remaining()]; value.get(bb); header.meta.put(key, bb); header.metaKeyList.add(key); } } while ((l = vin.mapNext()) != 0); } vin.readFixed(header.sync); // read sync // finalize the header header.metaKeyList = Collections.unmodifiableList(header.metaKeyList); header.schema = Schema.parse(getMetaString(DataFileConstants.SCHEMA),false); this.codec = resolveCodec(); reader.setSchema(header.schema);}
Data Block读取
类GenericDatumReader实现接口DatumReader,类图:
GenericDatumReader的字段如下:
private final GenericData data;private Schema actual;private Schema expected;private ResolvingDecoder creatorResolver = null;private final Thread creator;
接下来,且看dataFileReader.next(datum)
是如何读取data block的datum?在DataFileStream中next()通过调用DatumReader.read()来读取datum:
public D next(D reuse) throws IOException { if (!hasNext()) throw new NoSuchElementException(); D result = reader.read(reuse, datumIn); if (0 == --blockRemaining) { blockFinished(); } return result;}
在上面initialize()的reader.setSchema(header.schema);
中已将DatumReader的actual与expected设置成了header.schema
。看看GenericDatumReader.read()的实现:
@Override@SuppressWarnings("unchecked")public D read(D reuse, Decoder in) throws IOException { ResolvingDecoder resolver = getResolver(actual, expected); resolver.configure(in); D result = (D) read(reuse, expected, resolver); resolver.drain(); return result;}/** Called to read data.*/protected Object read(Object old, Schema expected, ResolvingDecoder in) throws IOException { Object datum = readWithoutConversion(old, expected, in); LogicalType logicalType = expected.getLogicalType(); if (logicalType != null) { Conversion<?> conversion = getData().getConversionFor(logicalType); if (conversion != null) { return convert(datum, expected, logicalType, conversion); } } return datum;}protected Object readWithoutConversion(Object old, Schema expected, ResolvingDecoder in) throws IOException { switch (expected.getType()) { case RECORD: return readRecord(old, expected, in); case ENUM: return readEnum(expected, in); case ARRAY: return readArray(old, expected, in); case MAP: return readMap(old, expected, in); case UNION: return read(old, expected.getTypes().get(in.readIndex()), in); case FIXED: return readFixed(old, expected, in); case STRING: return readString(old, expected, in); case BYTES: return readBytes(old, expected, in); case INT: return readInt(old, expected, in); case LONG: return in.readLong(); case FLOAT: return in.readFloat(); case DOUBLE: return in.readDouble(); case BOOLEAN: return in.readBoolean(); case NULL: in.readNull(); return null; default: throw new AvroRuntimeException("Unknown type: " + expected); }}
针对不同type的schema,read方法实现也不一样。具体我们来看readRecord的实现:
/** Called to read a record instance. May be overridden for alternate record * representations.*/protected Object readRecord(Object old, Schema expected, ResolvingDecoder in) throws IOException { Object r = data.newRecord(old, expected); Object state = data.getRecordState(r, expected); for (Field f : in.readFieldOrder()) { int pos = f.pos(); String name = f.name(); Object oldDatum = null; if (old!=null) { oldDatum = data.getField(r, name, pos, state); } readField(r, f, oldDatum, in, state); } return r;}
- 【Hadoop】Avro源码分析(二):反序列化之Generic
- avro反序列化
- 【Hadoop】Avro源码分析(一):Schema
- Avro序列化/反序列化
- Avro序列化与反序列化
- hadoop深入研究:(十六)——Avro序列化与反序列化
- Avro序列化操作(2):序列化和反序列化
- Avro实现序列化和反序列化
- Apache Avro 序列化与反序列化 (Java 实现)
- Protobuf 序列化和反序列化源码分析
- Hadoop学习日志之序列化和反序列化
- Hadoop 之 Avro
- Hadoop:序列化和反序列化
- Hadoop序列化和反序列化
- hadoop序列化和反序列化
- 大数据核心技术源码分析之-Avro篇
- 大数据核心技术源码分析之-Avro篇-2
- 大数据核心技术源码分析之-Avro篇-3
- 败者树
- ARM工作模式
- keil_mdk的安装注意事项
- 测试
- auto_ptr http://www.cnblogs.com/jtf-china/archive/2011/06/09/2076792.html
- 【Hadoop】Avro源码分析(二):反序列化之Generic
- 1083. List Grades
- 保存本地的一张图片到sd卡中
- Java
- ThreadLocal深入剖析
- hdu 1800 Flying to Mars 字符串哈希
- 一道排序笔试题,在o(n)时间内对一个数组进行排序
- 139端口入侵
- 【CTSC2008】【BZOJ1143】祭祀river