Kafka Introduction 官方文档学习笔记

来源：互联网发布：unity3d输出视频编辑：程序博客网时间：2024/05/16 01:24

Introduction

Apache Kafka™ is a distributed streaming platform. What exactly does that mean?

Kafka是一个分布式流平台，这意味着什么?

We think of a streaming platform as having three key capabilities:

It lets you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system.
It lets you store streams of records in a fault-tolerant way.
It lets you process streams of records as they occur.

流平台应该具有如下三个能力：

1.你可以使用它尕布或者订阅流记录，这方面类似于一个消息队列或者企业消息系统

2.可以使用它以容错的方式存储流记录

3.使用它可以及时处理产生的流记录，延迟低

What is Kafka good for?

It gets used for two broad classes of application:

Building real-time streaming data pipelines that reliably get data between systems or applications
Building real-time streaming applications that transform or react to the streams of data

Kafka优点：

1.可以构建数据与系统或者应用之间可靠的实时流数据管道

2.可以构建流应用，支持数据的transform和react处理

To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.

First a few concepts:

Kafka is run as a cluster on one or more servers.
The Kafka cluster stores streams of records in categories called topics.
Each record consists of a key, a value, and a timestamp.

为了理解Kakfa如何实现上面这些特性，让我们自底向上潜入探索一下

首先一些概念：

Kafka可以运行在一个集群上或者一个或者多个服务器上

Kafka存储流记录是以topic进行分类的

每一条记录包含key、value、timestamp

Kafka has four core APIs:

The Producer API allows an application to publish a stream of records to one or more Kafka topics.
The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

Kafka有四个核心API：

生产者API允许应用发布流记录到一个或者多个Kafka topic中

消费者API允许应用订阅一个或者多个topic和处理留记录并在生产流记录

流API允许应用充当一个流处理器，可以高效消费一个或者多个topic并产生输出流记录到一个或者多个topic

连接API允许构建和运行可重复使用的生产者或者消费者去连接kafka topic到现有的应用或者数据系统。例如：

一个connector连接到关系数据库去捕获数据库表的变化

In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages.

在Kafka中，客户端和服务器之间通信是一个简单，高性能，与语言无关的TCP协议。兼容老版本，我们提供了Java API，但是客户端支持很多语言。

Topics and Logs

Let's first dive into the core abstraction Kafka provides for a stream of records—the topic.

让我们首先研究一下Kafka的核心抽象，为流记录提供的，-- Topic

A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.

topic一个发布流记录的一个类别或者是名字，Kafka的topic通常都是多个订阅者，一个topic通常有0个或者1个甚至多个消费者订阅被写入的数据。

For each topic, the Kafka cluster maintains a partitioned log that looks like this:

对于每一个topic，Kafka集群护卫者一个分区日志如下这样：

补充:例如途Partition 0分区的0,1,2....10,11,12...都是分区中消息的偏移。

Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.

每一个分区都是有序的，不可以改变的记录序列，该序列被持续追加新记录 -- 一个带有结构的提交日志。每一个分区的记录都被指定一个序列号，称为偏移，该偏移在该分区中是唯一标识。

The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.

Kafka集群可以保存所有的记录（无论该记录是否被消费）。可以配置一个记录保存时间。例如：保存策略是2天的话，对于记录被发布的两天内，该记录都是可以被消费的，然而过了两天将会被删除释放空间。Kakfa性能是高效的，可以长时间存储数据。

In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".

实际上保存在每一个消费者上的元数据是消费的偏移或者消费的位置，偏移是有消费者控制的，所以我们可以自行定义消费者的消费位置。

This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.

这些特征以为这消费者是非常廉价的，消费者的创建和消费对集群或者其他消费者几乎没影响。例如：使用tail命令行查看topic的消息内容，他会不该改变其他已经存在的消费者信息。

The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.

日志中的设置分区有如下几个目的：首先，允许日志拓展存储到超过单个节点服务器大小的其他节点上，每一个有效的分区都存在对应主机的服务器上，但是一个topic可以有多个分区，因此可以解决因单个服务器因存储空间大小无法处理大量数据的情况。其次，分区充当并行度的单元。

Distribution

The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.

Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

日志分区分布式在kafka集群中，每一个分区都可以处理数据和接受请求，分区可以配置副本在多个服务器上以达到容错目的。

每个partition都有一个server为"leader"（一般为了容错每个分区有多个副本，其中之一为Leader，其余follower）;leader负责所有的读写操作,follower负责被动的复制保持副本数目。如果leader失败的话，会重新再foller中选择一个新的leader，每一个服务器充当起分区的leader，该leader的foller一般在其他机器上，这样在集群内会有较好的平衡。

Producers

Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!

生产者发布数据到指定的topic中，生产者负责选择那条记录写到哪一个topic分区中。这个使用round-robin算法完成的，根据一些语义分区功能维持的负载均衡，更多分区的使用是在下面消费者。

Consumers

Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.

消费者以分组名字给自己定义标签，每一个发布到topic的记录将会被发到订阅该topic的组中一个消费者实例中，消费者实例是独立的编程或者分布在多台机器上。

If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.

如果所有的消费者各自的组都不相同的话，每一条记录将会被发布到所有的消费中(广播)

A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.

More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.

通常，topic有少量的消费者组，每一个组是一个逻辑订阅。为了拓展好容错每一个组是由大量消费者组成，这个事发布订阅语义，通常订阅者都是消费者集群，而不是单个进程

The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.

Kafka实现消费的方式是将分区划分给消费者实例，每一个消费者实例在任何时候可以公平占用该分区，组中的消费者可以有Kafka协议动态调整，新的消费者实例假如组中的话，他们可以接管该组中其他消费者的分区，如果一个消费者失望的话，该分区会分配到其他存活的消费者去

Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.

Kafka只有在分区内记录有序，分区之间的记录是不保证顺序的，如果要保证topic全局有序的话，可以让该topic只有一个分区，同时这也意味着每一个消费者组只有一个消费者

Guarantees

At a high-level Kafka gives the following guarantees:

在一个高层，Kafka提供如下保证：

Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
A consumer instance sees records in the order they are stored in the log.
For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.

More details on these guarantees are given in the design section of the documentation.

消息追加到分区的顺序是他们发送的顺序。

消费者寻找记录的顺序是他们存储在日志中的顺序

如果一个topic的副本数是N，我们将会容忍N-1个服务器宕机，保证不会丢失数据

更多保证的细节将在设计文档处给出。

Kafka as a Messaging System (Kafka作为消息系统)

How does Kafka's notion of streams compare to a traditional enterprise messaging system?

Kafka的流概念和传统的企业消息系统比较?

Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren't multi-subscriber—once one process reads the data it's gone. Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber.

消息系统模型有两种：消息队列和发布订阅。队列模式，一个消费池可能会在一个服务器读取每一条消息（不是每一个consumer都能获取到每一条消息）。在发布订阅系统中，记录将会广播到所有的消费者，这两个模型有各自的优缺点。队列的优点是允许你可划分消息给多个线程，可以扩展处理。

不幸的是，确定没有多个订阅者，一旦数据被某一个进行读取，数据将会消失。发布订阅西戎可以让你广播数据给多个线程，但是没有办法扩展处理，因为每一条记录都会发到每一个订阅者。

The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.

消费者概念在Kafka中意味着两层含义：在队列模式，允许你划分处理给多个进程。但是在发布订阅中，你将广播消息给多个用户组。

The advantage of Kafka's model is that every topic has both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.

Kafka模型的优势是每一个topic都有这些属性，他可以扩展处理和给多个订阅者，不需要选择一个或者多个订阅者。

Kafka has stronger ordering guarantees than a traditional messaging system, too.

Kafka有强烈的顺序保证比传统的消息系统

A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands out records in the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.

传统队列按顺序保存记录，如果多个消费者从队列消费，顺序是他们存储的顺序。但是尽管服务器处理数据有顺序，但是记录在异步传输到消费者过程中可能并不是先发先到，因此顺序会错乱。这意味着并行处理会失去顺序，消息系统常常可以使用“独占消费”只允许一个线程从一个队列中消费，当然这意味着失去了并行处理。

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.

Kafka在这一点处理的比较好。使用topic的分区实现并行，Kafka可以保证顺序和负载均衡。这个实现是使用topic的分区，每一个分区只有分组中的一个消费者消费。这么做我们就可以确保没一个分区消费数据有序。因此这要要注意：每一个分组总的消费者数目不可以大于分组数目。

Kafka as a Storage System

Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for the in-flight messages. What is different about Kafka is that it is a very good storage system.

任何消息系统都允许生产消息和消费消息解耦，使其充当一个动态消息存储系统。Kafka作为存储系统有什么不同呢？

Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn't considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.

The disk structures Kafka uses scale well—Kafka will perform the same whether you have 50 KB or 50 TB of persistent data on the server.

数据写到Kafka之后在写到磁盘并且为了容错复制副本。kafka允许生产者等待消息写入操作结果的确认，Kafka使用的磁盘结构拓展较好，在50KK和50TB持久化性能相同

As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.

由于Kafka花费大量存储，允许客户端控制读取位置（占用存储大，offset就大）。你可以认为Kafka是一种特定的分布式文件系统，高性能，低延迟，有副本和可以传播。

Kafka for Stream Processing(Kafka流处理器)

It isn't enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.

Kafka不仅仅可以读写存储流数据，还是以进行实时处理。

In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.

Kafka处理器可以处理topic的输入数据，进行执行操作之后在输出到topic

For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.

例如零售应用可以输入销售和出货量的数据，然后进行排序和价格调整

It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together.

简单的处理可以直接使用生产者和消费者API，但是复杂的操作可以使用Streams API，这允许你侯建有意义的处理，进行聚集或者连接

This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.

The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.

Putting the Pieces Together（总结）

This combination of messaging, storage, and stream processing may seem unusual but it is essential to Kafka's role as a streaming platform.

A distributed file system like HDFS allows storing static files for batch processing. Effectively a system like this allows storing and processing historical data from the past.

A traditional enterprise messaging system allows processing future messages that will arrive after you subscribe. Applications built in this way process future data as it arrives.

Kafka combines both of these capabilities, and the combination is critical both for Kafka usage as a platform for streaming applications as well as for streaming data pipelines.

By combining storage and low-latency subscriptions, streaming applications can treat both past and future data the same way. That is a single application can process historical, stored data but rather than ending when it reaches the last record it can keep processing as future data arrives. This is a generalized notion of stream processing that subsumes batch processing as well as message-driven applications.

Likewise for streaming data pipelines the combination of subscription to real-time events make it possible to use Kafka for very low-latency pipelines; but the ability to store data reliably make it possible to use it for critical data where the delivery of data must be guaranteed or for integration with offline systems that load data only periodically or may go down for extended periods of time for maintenance. The stream processing facilities make it possible to transform data as it arrives.

For more information on the guarantees, apis, and capabilities Kafka provides see the rest of the documentation.

0 0