【Hadoop】Why Writable Interface

来源:互联网 发布:mac ps插件怎么安装 编辑:程序博客网 时间:2024/05/16 06:08

在前面学习RPC系统的时候,可以看到client和server相互通讯都是用Writable类型来序列化。实际上,Writable Interface是一种Hadoop内置的序列化机制,MapReduce中的key, value都需要是Writable。


最初一直不太理解,为何Hadoop需要自定义一种序列化机制,而不重用Java内置的序列化机制。《Hadoop The Definitive Guide》给出了如下解释:

The problem is that Java Serialization doesn’t meet the criteria for a serialization format listed earlier:compact, fast, extensible, and interoperable. 


http://blog.csdn.net/tragicjun/article/details/8897096一文中提过,序列化机制有两方面:Primitive type serialization和Constructed type serializtion。以下将从这两方面出发,理解为何Writable比Java Serialization更加compact。


Primitive type serialization

Java中对primitive type进行序列化,可以使用java.io.DataOutput接口。例如,序列化一个int:

  ByteArrayOutputStream out = new ByteArrayOutputStream();  DataOutputStream dataOut = new DataOutputStream(out);    dataOut.writeInt(163);  dataOut.close();
其中out.toByteArray().length的值为4。下面来看看Hadoop IntWritable如何encode:

  public void write(DataOutput out) throws IOException {    out.writeInt(value);  }
显然,其实它无非也是使用DataOutput。因此,在primitive type上,Writable和Java Serialization没有区别,甚至前者就是基于后者实现的。

Constructed type serializtion

Java中对constructed type进行序列化,可以使用java.io.ObjectOutput接口,例如,序列化一个CustomObject:

  ByteArrayOutputStream out = new ByteArrayOutputStream();  ObjectOutputStream objectOut = new ObjectOutputStream(out);    objectOut.writeObject(customObject);  objectOut.close();

然而,Writable的接口里只有DataOutput,没有ObjectOutput:

public interface Writable {  void write(DataOutput out) throws IOException;  void readFields(DataInput in) throws IOException;}
差别就在这里,先看ObjectOutput的javadoc描述:

The class of each serializable object is encoded including the class name and signature of the class, the values of the object's fields and arrays, and the closure of any other objects referenced from the initial objects.

Primitive data, excluding serializable fields and externalizable data, is written to the ObjectOutputStream in block-data records. A block data record is composed of a header and data. The block data header consists of amarker and the number of bytes to follow the header. Consecutive primitive data writes are merged into one block-data record.


从两方面去理解ObjectOutput序列化class object的额外开销:一方面,它会包括class name和class signature;另一方面,对于class object里所包含的primitive data,还会增加marker和byte length等header信息。


相比而言,Writable序列化class object,将直接用DataOutput序列化所包含的primitive data,没有额外的信息开销。当然,Writable这样做所牺牲的是通用性:ObjectOutput和ObjectInput是通用的接口,可以encode和decode任何实现Serializable接口的Java object(Serializable只是一个marker interface,不需要实现任何方法);而Writable则需要object自己实现encode和decode,所以每个Writable object都必须override write和readFields方法。对于Hadoop来说,Writable的序列化大小关乎整个系统的性能,因而这种牺牲是十分必要的。


Composite Pattern

最后一个问题是,既然Writable的primitive type序列化本质上就是使用Java serialization实现的,为什么不直接用int, long,而要wrap一层IntWritable,LongWritable?其实,这里是一种composite design pattern,好处是"Let client treat individual objects and compositions of objects uniformly"。不论是primitive type,还是constructed type,都是通过统一的write()/readFields()方法来encode/decode。



原创粉丝点击