kafka

来源：互联网发布：linux配置ant 编辑：程序博客网时间：2024/05/18 01:14

Kafka遵循了一种大部分消息系统共同的传统的设计：producer将消息推送到broker，consumer从broker拉取消息。

Pull有个缺点是，如果broker没有可供消费的消息，将导致consumer不断在循环中轮询，直到新消息到t达。为了避免这点，Kafka有个参数可以让consumer阻塞知道新消息到达(当然也可以阻塞知道消息的数量达到某个特定的量这样就可以批量发送)。

kafka是讲日志按照topic的形式存储，一个topic会按照partition存在同一个文件夹下;

Topic被分成了若干分区，每个分区在同一时间只被一个consumer消费。这意味着每个分区被消费的消息在日志中的位置仅仅是一个简单的整数：offset。这样就很容易标记每个分区消费状态就很容易了，仅仅需要一个整数而已。这样消费状态的跟踪就很简单了。

这带来了另外一个好处：consumer可以把offset调成一个较老的值，去重新消费老的消息。

一般分区的数量都比broker的数量多的多，各分区的leader均匀的分布在brokers中。.Kafka尽量的使所有分区均匀的分布到集群所有的节点上而不是集中在某些节点上，另外主从关系也尽量均衡这样每个几点都会担任一定比例的分区的leader.

本质上 kafka 只支持 Topic。每个 consumer 属于一个 consumer group;反过来说,每个 group 中可以有多个 consumer。

发送到 Topic的消息,只会被订阅此 Topic 的每个 group 中的一个 consumer 消费。
如果所有的 consumer 都具有相同的 group,这种情况和 queue 模式很像;消息将会在 consumers 之间负载均衡。
如果所有的 consumer 都具有不同的 group,那这就是"发布-订阅",消息将会广播给所有的消费者

在 kafka 中,一个 partition 中的消息只会被 group 中的一个consumer 消费;每个 group 中 consumer 消息消费互相独立;

我们可以认为一个 group 是一个"订阅"者,一个 Topic 中的每个 partions,只会被一个"订阅者"中的一个 consumer 消费,不过一个 consumer 可以
消费多个 partitions 中的消息。

kafka 只能保证一个 partition 中的消息被某个 consumer 消费时,消息是顺序的。事实上,从 Topic 角度来说,消息仍不是有序的。

注意consumer组的数量不能多于分区的数量，也就是有多少分区就允许多少并发消费。kafka 的设计原理决定,对于一个 topic,同一个 group 中不能有多于 partitions 个数的 consumer 同时消费 , 否则将意味着某些consumer 将无法得到消息。

[2016-05-19 15:14:39,952] INFO Completed load of log cpslogtopic-0 with log end offset 0 (kafka.log.Log)
[2016-05-19 15:14:39,954] INFO Created log for partition [cpslogtopic,0] in /wlztest/tmp/kafkalogs with properties {compression.type -> producer, file.delete.delay.ms -> 60000, max.message.bytes -> 1000012, min.insync.replicas -> 1, segment.jitter.ms -> 0, preallocate -> false, min.cleanable.dirty.ratio -> 0.5, index.interval.bytes -> 4096, unclean.leader.election.enable -> true, retention.bytes -> -1, delete.retention.ms -> 86400000, cleanup.policy -> delete, flush.ms -> 9223372036854775807, segment.ms -> 604800000, segment.bytes -> 1073741824, retention.ms -> 604800000, segment.index.bytes -> 10485760, flush.messages -> 9223372036854775807}. (kafka.log.LogManager)
[2016-05-19 15:14:39,955] INFO Partition [cpslogtopic,0] on broker 2: No checkpointed highwatermark is found for partition [cpslogtopic,0] (kafka.cluster.Partition)

[2016-05-19 15:14:39,935] INFO Completed load of log cpslogtopic-1 with log end offset 0 (kafka.log.Log)
[2016-05-19 15:14:39,935] INFO Created log for partition [cpslogtopic,1] in /wlztest/tmp/kafkalogs with properties {compression.type -> producer, file.delete.delay.ms -> 60000, max.message.bytes -> 1000012, min.insync.replicas -> 1, segment.jitter.ms -> 0, preallocate -> false, min.cleanable.dirty.ratio -> 0.5, index.interval.bytes -> 4096, unclean.leader.election.enable -> true, retention.bytes -> -1, delete.retention.ms -> 86400000, cleanup.policy -> delete, flush.ms -> 9223372036854775807, segment.ms -> 604800000, segment.bytes -> 1073741824, retention.ms -> 604800000, segment.index.bytes -> 10485760, flush.messages -> 9223372036854775807}. (kafka.log.LogManager)
[2016-05-19 15:14:39,935] INFO Partition [cpslogtopic,1] on broker 0: No checkpointed highwatermark is found for partition [cpslogtopic,1] (kafka.cluster.Partition)

[2016-05-19 15:14:39,937] INFO Completed load of log cpslogtopic-2 with log end offset 0 (kafka.log.Log)
[2016-05-19 15:14:39,937] INFO Created log for partition [cpslogtopic,2] in /wlztest/tmp/kafkalogs with properties {compression.type -> producer, file.delete.delay.ms -> 60000, max.message.bytes -> 1000012, min.insync.replicas -> 1, segment.jitter.ms -> 0, preallocate -> false, min.cleanable.dirty.ratio -> 0.5, index.interval.bytes -> 4096, unclean.leader.election.enable -> true, retention.bytes -> -1, delete.retention.ms -> 86400000, cleanup.policy -> delete, flush.ms -> 9223372036854775807, segment.ms -> 604800000, segment.bytes -> 1073741824, retention.ms -> 604800000, segment.index.bytes -> 10485760, flush.messages -> 9223372036854775807}. (kafka.log.LogManager)
[2016-05-19 15:14:39,938] INFO Partition [cpslogtopic,2] on broker 1: No checkpointed highwatermark is found for partition [cpslogtopic,2] (kafka.cluster.Partition)

Producer Configs

acks:This controls the durability of records that are sent. The following settings are common:acks=0 If set to zero then the producer will not wait for any acknowledgment from the server at all. The record will be immediately added to the socket buffer and considered sent. No guarantee can be made that the server has received the record in this case, and the retries configuration will not take effect.

acks=1 This will mean the leader will write the record to its local log but will respond without awaiting full acknowledgement from all followers. In this case should the leader fail immediately after acknowledging the record but before the followers have replicated it then the record will be lost.
acks=all This means the leader will wait for the full set of in-sync replicas to acknowledge the record. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee.

compression.type:Compression is of full batches of data, so the efficacy of batching will also impact the compression ratio (more batching means better compression).

batch.size:

client.id:this is to be able to track the source of requests beyond just ip/port by allowing a logical application name to be included in server-side request logging.

Consumer Configs

group.id:A unique string that identifies the consumer group this consumer belongs to. This property is required if the consumer uses either the group management functionality by usingsubscribe(topic) or the Kafka-based offset management strategy.

client.id:The purpose of this is to be able to track the source of requests beyond just ip/port by allowing a logical application name to be included in server-side request logging.

max.partition.fetch.bytes:

Producer API提供了以下功能：

提供了基于Zookeeper的broker自动感知能力，可以通过参数zk.connect实现【这个实验时没有成功？】。如果不使用Zookeeper，也可以使用broker.list参数指定一个静态的brokers列表，这样消息将被随机的发送到一个broker上，一旦选中的broker失败了，消息发送也就失败了。

可以将多个消息缓存到本地队列里，然后异步的批量发送到broker，可以通过参数producer.type=async做到。缓存的大小可以通过一些参数指定：queue.time和batch.size。一个后台线程（(kafka.producer.async.ProducerSendThread）从队列中取出数据并让kafka.producer.EventHandler将消息发送到broker，也可以通过参数event.handler定制handler，在producer端处理数据的不同的阶段注册处理器，比如可以对这一过程进行日志追踪，或进行一些监控。只需实现kafka.producer.async.CallbackHandler接口，并在callback.handler中配置。
自己编写Encoder来序列化消息，只需实现下面这个接口。默认的Encoder是kafka.serializer.DefaultEncoder。
- interface Encoder<T> {
- public Message toMessage(T data);
- }
通过分区函数kafka.producer.Partitioner类对消息分区。
- interface Partitioner<T> {
- int partition(T key, int numPartitions);
- }
分区函数有两个参数：key和可用的分区数量，从分区列表中选择一个分区并返回id。默认的分区策略是hash(key)%numPartitions.如果key是null,就随机的选择一个。可以通过参数partitioner.class定制分区函数。

其它说明：

关于生产者向指定的分区发送数据,通过设置 partitioner。class的属性来指定向那个分区发送数据,如果自己指定必须编写相应的程
序,默认是 kafka。producer。DefaultPartitioner,分区程序是基于散列的键。

public Map<String, List<KafkaStream<byte[], byte[]>>>createMessageStreams(Map<String, Integer> topicCountMap),

其中该方法的参数 Map 的 key 为 topic 名称,value 为 topic 对应的分区数,譬如说如果在 kafka 中不存在相应的 topic 时,则会创建一个
topic,分区数为 value,如果存在的话,该处的 value 则不起什么作用

在 consumer api 中 , 参数设计到数字部分 , 类似Map<String,Integer>,numStream,指的都是在 topic 不存在的时,会创建一个 topic,并且分区个数为 Integer,numStream,注意如果数字大于 broker 的配置中num。partitions 属性,会以 num。partitions 为依据创建分区个数的。
5、producer api,调用 send 时,如果不存在 topic,也会创建 topic,在该方法中没有提供分区个数的参数,在这里分区个数是由服务端broker 的配置中 num。partitions 属性决定的。

线性读写的情况下影响磁盘性能问题大约有两个方面：太多的琐碎的I/O操作和太多的字节拷贝。I/O问题发生在客户端和服务端之间，也发生在服务端内部的持久化的操作中。Kafka在提高效率上，做了:

消息集（message set）
Kafka建立了“消息集（message set）”的概念，将消息组织到一起，作为处理的单位。以消息集为单位处理消息，比以单个的消息为单位处理，会提升不少性能。Producer把消息集一块发送给服务端，而不是一条条的发送；服务端把消息集一次性的追加到日志文件中，这样减少了琐碎的I/O操作。consumer也可以一次性的请求一个消息集。

另外一个性能优化是在字节拷贝方面。在低负载的情况下这不是问题，但是在高负载的情况下它的影响还是很大的。为了避免这个问题，Kafka使用了标准的二进制消息格式，这个格式可以在producer,broker和producer之间共享而无需做任何改动。

zero copy
Broker维护的消息日志仅仅是一些目录文件，消息集以固定队的格式写入到日志文件中，这个格式producer和consumer是共享的，这使得Kafka可以一个很重要的点进行优化：消息在网络上的传递。现代的unix操作系统提供了高性能的将数据从页面缓存发送到socket的系统函数，在linux中，这个函数是sendfile.

一般将数据从文件发送到socket的数据流向：

操作系统把数据从文件拷贝内核中的页缓存中
应用程序从页缓存从把数据拷贝自己的内存缓存中
应用程序将数据写入到内核中socket缓存中
操作系统把数据从socket缓存中拷贝到网卡接口缓存，从这里发送到网络上。

这显然是低效率的，有4次拷贝和2次系统调用。Sendfile通过直接将数据从页面缓存发送网卡接口缓存，避免了重复拷贝，大大的优化了性能。

网络带宽

Kafka采用了端到端的压缩：因为有“消息集”的概念，客户端的消息可以一起被压缩后送到服务端，并以压缩后的格式写入日志文件，以压缩的格式发送到consumer，消息从producer发出到consumer拿到都被是压缩的，只有在consumer使用的时候才被解压缩，所以叫做“端到端的压缩”。

kafka 并没有提供 JMS 中的"事务性"、"消息确认
机制"、"消息分组"等企业级特性,kafka 只能作为"常规"的消息系
统,在一定程度上,尚未确保消息的发送与接收绝对可靠(比如,消息
重发,消息发送丢失等)。

0 0