大数据核心技术源码分析之-Avro篇-2
来源:互联网 发布:路由器mac地址 编辑:程序博客网 时间:2024/05/23 12:03
拿到Avro-trunk下的源码,第一个分析对象就是
avro-trunk_src\lang\java下的源码
源码结构包括avro,compiler,ipc,mapred,protobuf,thrift等等
首先切入avro中
一级类集中在JsonProperties[顶级抽象类]
Schema,Protocol【继承JsonProperties】
SchemaNormalization,以及SchemaBuilder
和Exception
从中可以分析avro核心支持所谓的Json格式Schema的原因所在
从Schema中可以看出所支持的Schema类型
public enum Type {
RECORD, ENUM, ARRAY, MAP, UNION, FIXED, STRING, BYTES,
INT, LONG, FLOAT, DOUBLE, BOOLEAN, NULL;
private String name;
private Type() { this.name = this.name().toLowerCase(); }
public String getName() { return name; }
};
而Protocol包括两类Message
针对JsonProperties内置为
Map<String,JsonNode> props = new LinkedHashMap<String,JsonNode>(1);
关注两个同步方法:
public synchronized JsonNode getJsonProp(String name) {
return props.get(name);
}
和
public synchronized void addProp(String name, JsonNode value) {}
实现读写的同步控制
在Protocol中定义的Message和TwoWayMessage如下
public class Message extends JsonProperties {
private String name;
private String doc;
private Schema request;
TwoWayMessage如下
private class TwoWayMessage extends Message {
private Schema response;
private Schema errors;
针对SchemaBuilder顾名思义为 创建对应的Schema
对应包含多种类型的Builder
还包含对应的FieldDefault系列和Completion
以及
private abstract static class Completion<R> {
protected abstract R complete(Schema schema);
}
针对FieldDefault的定义如下
private static abstract class FieldDefault<R, S extends FieldDefault<R, S>> extends Completion<S> {
private final FieldBuilder<R> field;
private Schema schema;
protected FieldDefault(FieldBuilder<R> field) {
this.field = field;
}
/** Completes this field with no default value **/
public final FieldAssembler<R> noDefault() {
return field.completeField(schema);
}
private FieldAssembler<R> usingDefault(Object defaultVal) {
return field.completeField(schema, defaultVal);
}
@Override
protected final S complete(Schema schema) {
this.schema = schema;
return self();
}
protected abstract S self();
}
关注最后一个方法:
// create default value JsonNodes from objects
private static JsonNode toJsonNode(Object o) {
try {
String s;
if (o instanceof ByteBuffer) {
// special case since GenericData.toString() is incorrect for bytes
// note that this does not handle the case of a default value with nested bytes
ByteBuffer bytes = ((ByteBuffer) o);
bytes.mark();
byte[] data = new byte[bytes.remaining()];
bytes.get(data);
bytes.reset(); // put the buffer back the way we got it
s = new String(data, "ISO-8859-1");
char[] quoted = JsonStringEncoder.getInstance().quoteAsString(s);
s = "\"" + new String(quoted) + "\"";
} else {
s = GenericData.get().toString(o);
}
return new ObjectMapper().readTree(s);
} catch (IOException e) {
throw new SchemaBuilderException(e);
}
}
通过NIO方式将Object转换为JsonNode
对应的JsonNode为org.codehaus.jackson.JsonNode;
分析其它源码结构在avro下的
包括data,file,generic,io,ipc,reflect,specific,tool,util
package data:
包括
Json
包括一个Writer和Reader
RecordBuilder
public interface RecordBuilder<T> {
T build();
}
RecordBuilderBase
public abstract class RecordBuilderBase<T extends IndexedRecord>
implements RecordBuilder<T>
该BuilderBase提供验证的模版方法
ErrorBuilder
一个继承的Builder
public interface ErrorBuilder<T> extends RecordBuilder<T> {
/** Gets the value */
Object getValue();
/** Sets the value */
ErrorBuilder<T> setValue(Object value);
/** Checks whether the value has been set */
boolean hasValue();
/** Clears the value */
ErrorBuilder<T> clearValue();
/** Gets the error cause */
Throwable getCause();
/** Sets the error cause */
ErrorBuilder<T> setCause(Throwable cause);
/** Checks whether the cause has been set */
boolean hasCause();
/** Clears the cause */
ErrorBuilder<T> clearCause();
}
在package下的file里面包括如下类继承体系
抽象类Codec.java定义了压缩和解压缩,getName,equals,hashCode等
public abstract class Codec {
/** Name of the codec; written to the file's metadata. */
public abstract String getName();
/** Compresses the input data */
public abstract ByteBuffer compress(ByteBuffer uncompressedData) throws IOException;
/** Decompress the data */
public abstract ByteBuffer decompress(ByteBuffer compressedData) throws IOException;
/**
* Codecs must implement an equals() method. Two codecs, A and B are equal
* if: the result of A and B decompressing content compressed by A is the same
* AND the retult of A and B decompressing content compressed by B is the same
**/
@Override
public abstract boolean equals(Object other);
/**
* Codecs must implement a hashCode() method that is consistent with equals().*/
@Override
public abstract int hashCode();
@Override
public String toString() {
return getName();
}
}
对应的子类包括:
public class BZip2Codec extends Codec 实现Implements bzip2 compression and decompression.
内部依赖org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream;
和org.apache.commons.compress.compressors.bzip2.BZip2CompressorOutputStream;
class DeflateCodec extends Codec 实现Implements DEFLATE (RFC1951) compression and decompression.
final class NullCodec extends Codec 实现Implements "null" (pass through) codec.
class SnappyCodec extends Codec 实现Implements Snappy compression and decompression
内部使用CRC32 crc32 = new CRC32();
注意上述的4个子类,一个是public,另外两个为定义访问控制项,还有一个为final
以及Codec的抽象工厂
public abstract class CodecFactory
对应的createInstance是抽象工厂方法
/** Creates internal Codec. */
protected abstract Codec createInstance();
工厂注册
public static CodecFactory addCodec(String name, CodecFactory c) {
return REGISTERED.put(name, c);
}
创建工厂方法
public static CodecFactory fromString(String s) {
CodecFactory o = REGISTERED.get(s);
if (o == null) {
throw new AvroRuntimeException("Unrecognized codec: " + s);
}
return o;
}
已经对应的具体工厂实例
public static CodecFactory nullCodec() {
return NullCodec.OPTION;
}
/** Deflate codec, with specific compression.
* compressionLevel should be between 1 and 9, inclusive. */
public static CodecFactory deflateCodec(int compressionLevel) {
return new DeflateCodec.Option(compressionLevel);
}
/** Snappy codec.*/
public static CodecFactory snappyCodec() {
return new SnappyCodec.Option();
}
/** bzip2 codec.*/
public static CodecFactory bzip2Codec() {
return new BZip2Codec.Option();
}
两个跟file有关的接口
SeekableInput
public interface SeekableInput extends Closeable {
/** Set the position for the next {@link java.io.InputStream#read(byte[],int,int) read()}. */
void seek(long p) throws IOException;
/** Return the position of the next {@link java.io.InputStream#read(byte[],int,int) read()}. */
long tell() throws IOException;
/** Return the length of the file. */
long length() throws IOException;
/** Equivalent to {@link java.io.InputStream#read(byte[],int,int)}. */
int read(byte[] b, int off, int len) throws IOException;
}
四个方法
seek,tell,length,read
对应的子类SeekableFileInput
public class SeekableFileInput
extends FileInputStream implements SeekableInput {
public SeekableFileInput(File file) throws IOException { super(file); }
public SeekableFileInput(FileDescriptor fd) throws IOException { super(fd); }
public void seek(long p) throws IOException { getChannel().position(p); }
public long tell() throws IOException { return getChannel().position(); }
public long length() throws IOException { return getChannel().size(); }
}
另外一个子类SeekableByteArrayInput
public class SeekableByteArrayInput extends ByteArrayInputStream implements SeekableInput {
public SeekableByteArrayInput(byte[] data) {
super(data);
}
public long length() throws IOException {
return this.count;
}
public void seek(long p) throws IOException {
this.reset();
this.skip(p);
}
public long tell() throws IOException {
return this.pos;
}
}
另外一个接口为FileReader,包括next,sync,pastSync,tell四个方法
public interface FileReader<D> extends Iterator<D>, Iterable<D>, Closeable {
/** Return the schema for data in this file. */
Schema getSchema();
D next(D reuse) throws IOException;
void sync(long position) throws IOException;
boolean pastSync(long position) throws IOException;
long tell() throws IOException;
}
对应实现子类包括:
DataFileReader
public class DataFileReader<D>
extends DataFileStream<D> implements FileReader<D> {}
以及另外的一个版本DataFileReader12
/** Read files written by Avro version 1.2. */
public class DataFileReader12<D> implements FileReader<D>, Closeable {}
该类中有几个方法值得关注
@Override
public synchronized D next(D reuse) throws IOException {
while (blockCount == 0) { // at start of block
if (in.tell() == in.length()) // at eof
return null;
skipSync(); // skip a sync
blockCount = vin.readLong(); // read blockCount
if (blockCount == FOOTER_BLOCK) {
seek(vin.readLong()+in.tell()); // skip a footer
}
}
blockCount--;
return reader.read(reuse, vin);
}
public synchronized void seek(long position) throws IOException {
in.seek(position);
blockCount = 0;
blockStart = position;
vin = DecoderFactory.get().binaryDecoder(in, vin);
}
/** Move to the next synchronization point after a position. */
@Override
public synchronized void sync(long position) throws IOException {
if (in.tell()+SYNC_SIZE >= in.length()) {
seek(in.length());
return;
}
in.seek(position);
vin.readFixed(syncBuffer);
for (int i = 0; in.tell() < in.length(); i++) {
int j = 0;
for (; j < sync.length; j++) {
if (sync[j] != syncBuffer[(i+j)%sync.length])
break;
}
if (j == sync.length) { // position before sync
seek(in.tell() - SYNC_SIZE);
return;
}
syncBuffer[i%sync.length] = (byte)in.read();
}
seek(in.length());
}
以及构造函数
public DataFileReader12(SeekableInput sin, DatumReader<D> reader)
throws IOException {
this.in = new DataFileReader.SeekableInputStream(sin);
byte[] magic = new byte[4];
in.read(magic);
if (!Arrays.equals(MAGIC, magic))
throw new IOException("Not a data file.");
long length = in.length();
in.seek(length-4);
int footerSize=(in.read()<<24)+(in.read()<<16)+(in.read()<<8)+in.read();
seek(length-footerSize);
long l = vin.readMapStart();
if (l > 0) {
do {
for (long i = 0; i < l; i++) {
String key = vin.readString(null).toString();
ByteBuffer value = vin.readBytes(null);
byte[] bb = new byte[value.remaining()];
value.get(bb);
meta.put(key, bb);
}
} while ((l = vin.mapNext()) != 0);
}
this.sync = getMeta(SYNC);
this.count = getMetaLong(COUNT);
String codec = getMetaString(CODEC);
if (codec != null && ! codec.equals(NULL_CODEC)) {
throw new IOException("Unknown codec: " + codec);
}
this.schema = Schema.parse(getMetaString(SCHEMA));
this.reader = reader;
reader.setSchema(schema);
seek(MAGIC.length); // seek to start
}
当然还包括
DataFileStream实现Iterator
public class DataFileStream<D> implements Iterator<D>, Iterable<D>, Closeable {
内置核心方法
@Override
public boolean hasNext() {
try {
if (blockRemaining == 0) {
// check that the previous block was finished
if (null != datumIn) {
boolean atEnd = datumIn.isEnd();
if (!atEnd) {
throw new IOException("Block read partially, the data may be corrupt");
}
}
if (hasNextBlock()) {
block = nextRawBlock(block);
block.decompressUsing(codec);
blockBuffer = block.getAsByteBuffer();
datumIn = DecoderFactory.get().binaryDecoder(
blockBuffer.array(), blockBuffer.arrayOffset() +
blockBuffer.position(), blockBuffer.remaining(), datumIn);
}
}
return blockRemaining != 0;
} catch (EOFException e) { // at EOF
return false;
} catch (IOException e) {
throw new AvroRuntimeException(e);
}
}
以及一个DataFileWriter
public class DataFileWriter<D> implements Closeable, Flushable {
核心方法
/** Flush the current state of the file. */
@Override
public void flush() throws IOException {
sync();
vout.flush();
}
public void close() throws IOException {
if (isOpen) {
flush();
out.close();
isOpen = false;
}
}
以及LengthLimitedInputStream.java类
class LengthLimitedInputStream extends FilterInputStream {}
更多内容分析继续......
- 大数据核心技术源码分析之-Avro篇-2
- 大数据核心技术源码分析之-Avro篇
- 大数据核心技术源码分析之-Avro篇-3
- 大数据核心技术
- 大数据核心技术
- 大数据分析你不能不懂的6个核心技术
- java核心技术之jdk源码大揭密(一)
- 【Hadoop】Avro源码分析(一):Schema
- 【Hadoop】Avro源码分析(二):反序列化之Generic
- 大数据核心技术ETL简介
- 大数据方面的核心技术
- Spark-Avro学习2之使用byDatabricksSparkAvroL读取AVRO文件
- 大数据学习(二):Hadoop源码分析
- 大数据WEB阶段(十五)JavaEE三大核心技术之过滤器
- 大数据WEB阶段(十六)JavaEE三大 核心技术之监听器Listener
- 玩可视化大数据分析软件要掌握的6个核心技术(上)
- 玩可视化大数据分析软件要掌握的6个核心技术(下)
- android源码分析之大字体
- 人工智能:EMC GP发布Pivotal HD和HAWG详解
- android ADT 无法获取更新列表 解决办法
- media-type,media-query
- python内置类型(list,dictionary, tuple , string, )
- 数据库的一些知识点,select 查询语句复习总结
- 大数据核心技术源码分析之-Avro篇-2
- [week2]每周总结与工作计划
- 谈谈对攻读计算机研究生的看法
- 关系数据库的特点
- Mysql 创建用户、删除用户
- 剑指Offer 1521 二叉树的镜像
- 杭州赛区网络赛
- Exception in thread "main" java.lang.NoClassDefFoundError: org.jaxen.NamespaceContext
- 东方故事的故事大概