apache kafka log 存储格式

来源：互联网发布：星际皆知你爱我txt 编辑：程序博客网时间：2024/05/29 03:20

Log Format

A log for a topic named "my_topic" with two partitions consists of two directories (namely my_topic_0 and my_topic_1) populated with data files containing the messages for that topic.

The format of the log files is a sequence of "log entries"";
一个命名为"my_topic"的topic拥有两个分区,由两个目录组成，目录里面包含数据为消息的数据文件。
这些log文件的格式是一串log入口。

each log entry is a 4 byte integer N storing the message length which is followed by the N message bytes.

每一个log入口是一个4字节N整数保存着消息的长度，接下来是N个字节的消息体。

Each message is uniquely identified by a 64-bit integer offset giving the byte position of the start of this message in the stream of all messages ever sent to that topic on that partition.

每一个消息被一个64字节的offset 整数唯一标识着。offset 是所有被发送到某一个topic特定分区的消息的开始位置。

The on-disk format of each message is given below.

每一个消息的格式如下图展示。

Each log file is named with the offset of the first message it contains.

每一个log文件以包含的第一个开始消息的offset来命名。

So the first file created will be 00000000000.kafka, and each additional file will have an integer name roughly S bytes from the previous file where S is the max log file size given in the configuration.

所以，第一个文件以00000000000.kafka命名，并且每增加一个文件，文件名称将是前一个文件的字节总数之和。并且每一个log文件的最大字节max是可以通过配置文件配置的。



The use of the message offset as the message id is unusual.
使用消息的offset就像使用消息的ID一样，是独一无二的。

Our original idea was to use a GUID generated by the producer, and maintain a mapping from GUID to offset on each broker.
我们最初的想法是通过producer使用GUID来生成的方法产生offset，并且在每一个broker把GUID和offset进行映射。

But since a consumer must maintain an ID for each server, the global uniqueness of the GUID provides no value.
但是因为一个consumer 必须拥有每一个server的ID,GUID的全局唯一性不存在这个值。

Furthermore, the complexity of maintaining the mapping from a random id to an offset requires a heavy weight index structure which must be synchronized with disk, essentially requiring a full persistent random-access data structure.
更深入的是，从一个随机的ID映射到offset需要一个复杂的索引结果，并且需要同步持续化到硬盘上面。


Thus to simplify the lookup structure we decided to use a simple per-partition atomic counter which could be coupled with the partition id and node id to uniquely identify a message;

所以，为了简化查询结构，我们决定使用一个简单的在每一个分区上面使用自动增长的计数器,并且这个计数器可以组合分区ID和节点ID来识别这个消息。

this makes the lookup structure simpler, though multiple seeks per consumer request are still likely.

However once we settled on a counter, the jump to directly using the offset seemed natural—both after all are monotonically increasing integers unique to a partition.

然而，一旦我们设置了计数器，在分区上使用线性增长的计数器似乎更加自然。
Since the offset is hidden from the consumer API this decision is ultimately an implementation detail and we went with the more efficient approach.
并且,offset在consumer的api中被隐藏了，我们这个决定似乎更加的高效。


Writes

The log allows serial appends which always go to the last file.

This file is rolled over to a fresh file when it reaches a configurable size (say 1GB).

The log takes two configuration parameters: M, which gives the number of messages to write before forcing the OS to flush the file to disk, and S, which gives a number of seconds after which a flush is forced.

his gives a durability guarantee of losing at most M messages or S seconds of data in the event of a system crash.

log 文件运行多个消息被添加最后一个文件的后面。当文件达到一个最大的配置大小时，文件将被覆盖掉。
log文件用于两个配置参数：M 参数表示强制操作系统可以缓存多少消息，然后才把文件持久化到硬盘。 S参数表示多少秒一次把操作系统缓存中的消息持久化到硬盘中

这种持久化操作确保当系统宕机时仅仅最多丢失M条消息或者丢失S秒前的数据。

Reads

Reads are done by giving the 64-bit logical offset of a message and an S-byte max chunk size.

读取操作通过一个64位的消息的逻辑offset和一个S区块大小。
This will return an iterator over the messages contained in the S-byte buffer.
这样会迭代包含一个S字节大小的Buffer.

S is intended to be larger than any single message, but in the event of an abnormally large message, the read can be retried multiple times, each time doubling the buffer size, until the message is read successfully.
S比单个的消息大，但是也不会无限大。在找到消息前会迭代多次。


A maximum message and buffer size can be specified to make the server reject messages larger than some size, and to give a bound to the client on the maximum it needs to ever read to get a complete message.

It is likely that the read buffer ends with a partial message, this is easily detected by the size delimiting.

The actual process of reading from an offset requires first locating the log segment file in which the data is stored, calculating the file-specific offset from the global offset value, and then reading from that file offset.

实际上，将会从第一个log片段文件的offset开始读取，并且计算全局的offset,然后读取文件的offset.

The search is done as a simple binary search variation against an in-memory range maintained for each file.
这个方式是在内存中保存每一个文件，并且通过一个简单的二分搜索方式来查询offset.

The log provides the capability of getting the most recently written message to allow clients to start subscribing as of "right now".
log会缓存最近写入到分区的消息，以便客户端可以从 “right now”开始订阅消息。

This is also useful in the case the consumer fails to consume its data within its SLA-specified number of days.

In this case when the client attempts to consume a non-existent offset it is given an OutOfRangeException and can either reset itself or fail as appropriate to the use case.




http://kafka.apache.org/documentation/#log

0 0