浅析protobuf（未完待续）

来源：互联网发布：84aaa最新域名升级编辑：程序博客网时间：2024/05/21 20:23

浅析protobuf

首先，我们先来了解一下protobuf是个什么东西，是用来做什么的。
这一点官方文档已经说得很明白：

Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the “old” format.

Protobuf 是一个灵活、高效、自动化的结构化数据序列化机制，就像XML一样。但是，protobuf体积更小，速度更快，更易用。你只需一次定义来说明你希望数据如何结构化，然后就可以十分容易地使用生成出来的多种不同语言的源代码来从（向）不同的数据流读（写）你的结构化数据。甚至，你就算更新你的数据结构也不会破坏使用旧格式编译出来的已部署的程序。

protobuf的基本数据类型及修饰符

protobuf支持的数据类型

数据类型相关信息 double float int32 使用可变长编码，编码负数时效率低，如果字段可能为负，则应该使用sint32 int64 使用可变长编码，编码负数时效率低，如果字段可能为负，则应该使用sint64 uint32 使用可变长编码 uint64 使用可变长编码 sint32 使用可变长编码，编码负数时效率比int32高 sint64 使用可变长编码，编码负数时效率比int64高 fixed32 总是占用4个字节，当字段的值大于2²⁸时效率比uint32高 fixed64 总是占用8个字节，当字段的值大于2⁵⁶时效率比uint64高 sfixed32 总是占用4个字节 sfixed64 总是占用8个字节 bool string string字段的值必须是UTF-8编码或7bit ASCII编码的文本 bytes 可以包含任意字节序列

修饰符
- required：一条格式良好的message必须包含恰好一个该字段。
- optional：一条格式良好的message必须包含不多于一个该字段。
- repeated：该字段可以重复任意多次（包括0次），多个值的顺序也会被记录下来。

protobuf API分析

首先，先利用官方tutorial中的.proto文件生成.pb.h和.pb.cc文件。

.proto

syntax = "proto2";package tutorial;message Person {  required string name = 1;  required int32 id = 2;  optional string email = 3;  enum PhoneType {    MOBILE = 0;    HOME = 1;    WORK = 2;  }  message PhoneNumber {    required string number = 1;    optional PhoneType type = 2 [default = HOME];  }  repeated PhoneNumber phones = 4;}message AddressBook {  repeated Person people = 1;}

CopyFrom

生成的Person类提供了多个API，先来分析其中的CopyFrom接口。
其定义如下：

void Person::CopyFrom(const ::google::protobuf::Message& from) {    if (&from == this) return;    Clear();    MergeFrom(from);}void Person::CopyFrom(const Person& from) {    if (&from == this) return;    Clear();    MergeFrom(from);}

可以看出，当from和this都是指向同一个对象时，CopyFrom 会直接返回，否则会先调用Clear 将所有字段设为空，再利用MergeFrom 将from 中的值合并到this上，到此CopyFrom 已经分析完毕了。

MergeFrom（一）

那么，先放下Clear，去看看MergeFrom 是怎么实现的。

void Person::MergeFrom(const Person& from) {    ...    phones_.MergeFrom(from.phones_);    cached_has_bits = from._has_bits_[0];    if (cached_has_bits & 7u) {        if (cached_has_bits & 0x00000001u) {            set_has_name();            name_.AssignWithDefault(&::google::protobuf::internal::GetEmptyStringAlreadyInited(), from.name_);        }        if (cached_has_bits & 0x00000002u) {            set_has_email();            email_.AssignWithDefault(&::google::protobuf::internal::GetEmptyStringAlreadyInited(), from.email_);        }        if (cached_has_bits & 0x00000004u) {            id_ = from.id_;        }        _has_bits_[0] |= cached_has_bits;    }}

首先看到的是对phones字段的合并。

phones_.MergeFrom(from.phones_);

由于phones字段也是一个message，因此该merge操作直接交由phones对象自己执行就可以了。

然后就看到了后面全都是对cached_has_bits，即from._has_bits_[0]的或运算。那from._has_bits_是什么呢?

在头文件里面可以看到

class Person : public ::google::protobuf::Message {    ...private:    ::google::protobuf::internal::HasBits<1> _has_bits_;}

原来，_has_bits_是一个HasBits<1>对象，那HasBits又是什么？查看protobuf的源文件可以看到HasBits的定义如下：

template<size_t doublewords>class HasBits {public:    ...    ::google::protobuf::uint32& operator[](int index) GOOGLE_PROTOBUF_ATTRIBUTE_ALWAYS_INLINE {    return has_bits_[index];    }    const ::google::protobuf::uint32& operator[](int index) const GOOGLE_PROTOBUF_ATTRBUTE_ALWAYS_INLINE {        return has_bits_[index];        }    bool operator==(const HasBits<doublewords>& rhs) const {        return memcmp(has_bits_, rhs.has_bits_, sizeof(has_bits_)) = 0;    }    bool operator!=(const HasBits<doublewords>& rhs) const {        return !(*this == rhs);    }    bool empty() const;private:        ::google:protobuf::uint32 has_bits_[doublewords];};

从定义中可以看出，其实HasBits 只是对uint32数组的一个封装。因此，可以将_has_bits_ 先简单地看成一个uint32数组，数组的长度是1。

小结

说到这里，大家这么零散地看了一堆定义，可能已经有点头晕了，先稍微休息一下做个小结以及预测。

总结一下前面的内容，首先，我们利用.proto 产生了头文件以及源文件。在阅读这两个文件的时候，发现CopyFrom 接口是通过调用Clear接口以及MergeFrom接口实现的。而MergeFrom接口里面有大量关于 _has_bits_ 的运算，为了进一步了解MergeFrom的实现，必须先明白_has_bits_ 的作用是什么。

那么，_has_bits_是用来做什么的呢，据我的预测，_has_bits 作为一个uint32数组，是用来标记某个字段是否已经设置了值。而每个字段占用1bit作为标志位。其实就是一个bitmap。

当然，这只是我的推测，接下来让我们一起去验证一下。

MergeFrom（二）

再看回MergeFrom

if (cached_has_bits & 7u) { ... }

根据我们的上面的猜测，7的二进制表示是0111，那么这里是判断from 某3个字段是否有任意一个已经设置了。

if (cached_has_bits & 0x00000001u) {    set_has_name();    ...}

这里判断最低位的bit是否为1，是则调用set_has_name()，那么来看一下set_has_name的定义：

inline void Person::set_has_name() {    _has_bits[0] |= 0x00000001u;}

看来，猜想很可能是对的，其实message还有一个API是has_，我们来看看has_name的定义：

inline bool Person::has_name() const {    return (_has_bits_[0] & 0x00000001u) != 0;}

真相出来了，protobuf会自动为每个字段找到一个bit来记录他是否已经被设置，这个bit记录在has_bits数组中，具体可以是哪个bit，可以通过查看生成出来的头文件查看。

现在，可以明白的是，MergeFrom先一次性判断连续多个bit是否有任意一个为1来确认有某个字段已经设置，如果有，才对这连续的多个bit进行逐一判断，只有已经设置了的字段，才会被合并到this中。

（未完待续）

阅读全文

0 0