Kafka#2：消息队列

来源：互联网发布：色选机知乎编辑：程序博客网时间：2024/05/29 23:22

Kafka系列：

Kafka#1：QuickStart

问题

消息协议
消息订阅
消息存储
消息投递
消息顺序
消息清理
消息优先级
消息过滤
消息堆积
事务消息？

消息协议

Kafka的消息协议参考官方CF上的文章，Wire Format和Writing a Driver for Kafka。

## Request Header 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|                       REQUEST_LENGTH                          |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|         REQUEST_TYPE          |        TOPIC_LENGTH           |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+/                                                               //                    TOPIC (variable length)                    /+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|                           PARTITION                           |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

## Multi-Request Header 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|                       REQUEST_LENGTH                          |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|         REQUEST_TYPE          |    TOPICPARTITION_COUNT       |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

## Message 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|                             LENGTH                            |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|     MAGIC       |  COMPRESSION  |           CHECKSUM          |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|      CHECKSUM (cont.)           |           PAYLOAD           /+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                             //                         PAYLOAD (variable length)             /+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+## LENGTH = int32 // Length in bytes of entire message (excluding this field)## MAGIC = int8   // 0 = COMPRESSION attribute byte does not exist (v0.6 and below)                  // 1 = COMPRESSION attribute byte exists (v0.7 and above)## COMPRESSION = int8 // 0 = none; 1 = gzip; 2 = snappy;                      // Only exists at all if MAGIC == 1## CHECKSUM = int32   // CRC32 checksum of the PAYLOAD## PAYLOAD = Bytes[]  // Message content

## Produce Request 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+/                         REQUEST HEADER                        //                                                               /+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|                         MESSAGES_LENGTH                       |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+/                                                               //                            MESSAGES                           /+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

## Multi-Produce Request +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ /                   MULTI-REQUEST HEADER                        / +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |                   TOPIC-PARTION/MESSAGES (n times)             | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  ## Per Topic-Partition (repeated n times)  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |        TOPIC_LENGTH           |  TOPIC (variable length)      / +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |                           PARTITION                           | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |                         MESSAGES_LENGTH                       | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ /                            MESSAGES                           / +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

## Fetch Request 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+/                         REQUEST HEADER                        //                                                               /+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|                             OFFSET                            ||                                                               |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|                            MAX_SIZE                           |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+## OFFSET   = int64 // Offset in topic and partition to start from

## Multi-Fetch Request  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ /                   MULTI-REQUEST HEADER                        / +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |             TOPIC-PARTION-FETCH-REQUEST  (n times )           | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ## Per Topic-Partition-Fetch- Request (repeated n times)  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |        TOPIC_LENGTH           |  TOPIC (variable length)      / +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |                           PARTITION                           | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |                         OFFSET                                | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ /                            MAX_SIZE                           / +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

消息订阅

因为使用了pull方式，broker不需要维护订阅关系，Kafka将订阅关系保存到ZK上了，参考Consumer registration，

## /consumers/[groupId]/ids/[consumerId] -> Schema:{ "fields":    [ {"name": "version", "type": "int", "doc": "version id"},      {"name": "pattern", "type": "string", "doc": "can be of static, white_list or black_list"},      {"name": "subscription", "type" : {"type": "map", "values": {"type": "int"},                                         "doc": "a map from a topic or a wildcard pattern to the number of streams"}      }    ]} Example:A static subscription:{  "version": 1,  "pattern": "static",  "subscription": {"topic1": 1, "topic2": 2}}A whitelist subscription:{  "version": 1,  "pattern": "white_list",  "subscription": {"abc": 1}}A blacklist subscription:{  "version": 1,  "pattern": "black_list",  "subscription": {"abc": 1}}

## /consumers/[groupId]/offsets/[topic]/[partitionId] -> long (offset)

最近有官方文档给出如下说明要将consumer的offset（也就是上面的/consumers/[groupId]/offsets/[topic]/[partitionId] -> long (offset)）从ZK迁移到Kafka，

ZooKeeper does not scale extremely well (especially for writes) when there are a large number of offsets (i.e., consumer-count * partition-count).

新的解决方案是将offset作为一个topic下的message保存，

Fortunately, Kafka now provides an ideal mechanism for storing consumer offsets. Consumers can commit their offsets in Kafka by writing them to a durable (replicated) and highly available topic. Consumers can fetch offsets by reading from this topic (although we provide an in-memory offsets cache for faster access). i.e., offset commits are regular producer requests (which are inexpensive) and offset fetches are fast memory look ups.

消息存储

通过之前的QuickStart已经看到，物理上，每一个topic会有几个partition目录，partition下面的就是message文件，称为log文件，log文件是以它所保存的第一条message的offset命名。每一条message就是一个log entry，log entry的格式如下，

## On-disk format of a messagemessage length : 4 bytes (value: 1+4+n) magic value    : 1 bytecrc            : 4 bytespayload        : n bytes

除了log文件，目录下还有index文件，这又是做啥的？没有找到官网文档，在LogSegment中找到了说明，

/** * A segment of the log. Each segment has two components: a log and an index. The log is a FileMessageSet containing * the actual messages. The index is an OffsetIndex that maps from logical offsets to physical file positions. Each  * segment has a base offset which is an offset <= the least offset of any message in this segment and > any offset in * any previous segment. *  * A segment with a base offset of [base_offset] would be stored in two files, a [base_offset].index and a [base_offset].log file.  *  * @param log The message set containing log entries * @param index The offset index * @param baseOffset A lower bound on the offsets in this segment * @param indexIntervalBytes The approximate number of bytes between entries in the index * @param time The time instance */

消息投递

pull or push

The Kafka consumer works by issuing "fetch" requests to the brokers leading the partitions it wants to consume. The consumer specifies its offset in the log with each request and receives back a chunk of log beginning from that position. The consumer thus has significant control over this position and can rewind it to re-consume data if need be.

之所以选择pull而不是push，官方给出两点说明，

消费速度控制。push方式，是由broker来控制消息传输速度，当consumer的消费能力低于broker的传输速度时，这样有可能会拖垮consumer（a denial of service attack, in essence）。而采用pull方式，consumer完全可以等自己的消费能力上来之后再catch up。
消费数量控制。push方式要嘛一次发送一条message，要嘛等到一定量的message一次性发送。前者太浪费，后者延迟高而且也会增加broker设计的复杂性。而采用pull方式，consumer完全可以根据自身情况看一次要拉多少数据。

用pull方式，需要解决轮询效率的问题。Kafka是如何处理这个问题的呢？RTFSC，参考官方给出的Consumer Example，跟代码进去，发送fetch请求是在ConsumerFetcherThread，ConsumerFetcherThread extends AbstractFetcherThread extends ShutdownableThread extends java.lang.Thread，看下ShutdownableThread#run，

override def run(): Unit = {    info("Starting ")    try{      // 直接while循环一直发送请求      while(isRunning.get()){        doWork()      }    } catch{      case e: Throwable =>        if(isRunning.get())          error("Error due to ", e)    }    shutdownLatch.countDown()    info("Stopped ")  }

AbstractFetcherThread#doWork,

  override def doWork() {    inLock(partitionMapLock) {      // partitionMap最初由#addPartitions方法写入      // 除非调用#removePartitions不然partitionMap是不会为空的      if (partitionMap.isEmpty)        partitionMapCond.await(200L, TimeUnit.MILLISECONDS)      partitionMap.foreach {        case((topicAndPartition, offset)) =>          fetchRequestBuilder.addFetch(topicAndPartition.topic, topicAndPartition.partition,                           offset, fetchSize)      }    }    val fetchRequest = fetchRequestBuilder.build()    if (!fetchRequest.requestInfo.isEmpty)      // 真正向broker请求message      processFetchRequest(fetchRequest)  }

如此设计，当broker没有新的message时岂不效率很低？别慌，Kafka还是实现了long polling机制的，

The deficiency of a naive pull-based system is that if the broker has no data the consumer may end up polling in a tight loop, effectively busy-waiting for data to arrive. To avoid this we have parameters in our pull request that allow the consumer request to block in a "long poll" waiting until data arrives (and optionally waiting until a given number of bytes is available to ensure large transfer sizes).

上述所说的参数在FetchRequest，就是maxWait和minBytes。

case class FetchRequest private[kafka] (versionId: Short = FetchRequest.CurrentVersion,                                        override val correlationId: Int = FetchRequest.DefaultCorrelationId,                                        clientId: String = ConsumerConfig.DefaultClientId,                                        replicaId: Int = Request.OrdinaryConsumerId,                                        maxWait: Int = FetchRequest.DefaultMaxWait,                                        minBytes: Int = FetchRequest.DefaultMinBytes,                                        requestInfo: Map[TopicAndPartition, PartitionFetchInfo])

load balance

还有另一个非常重要的点要说明下，Each broker partition is consumed by a single consumer within a given consumer group，也就是说同一个group下每一个partition只有一个确定的consumer，称之为owner。如果consumer的数量比partition来得多，那么某些consumer将不会消费到任何message。所以很显然，Kafka是无法支持多播的。owner被记录在ZK上，/consumers/[group_id]/owners/[topic]/[broker_id-partition_id] -> consumer_node_id。那么owner是怎么分配的？

1. For each topic T that Ci subscribes to 2.   let PT be all partitions producing topic T3.   let CG be all consumers in the same group as Ci that consume topic T4.   sort PT (so partitions on the same broker are clustered together)5.   sort CG6.   let i be the index position of Ci in CG and let N = size(PT)/size(CG)7.   assign partitions from i*N to (i+1)*N - 1 to consumer Ci // i*N to i*N+N-18.   remove current entries owned by Ci from the partition owner registry9.   add newly assigned partitions to the partition owner registry     (we may need to re-try this until the original partition owner releases its ownership)

举个栗子，

let PT=[0, 1, 2, 3, 4, 5, 6, 7];let CG=[0, 1, 2, 3];then N=PT.len/CG.len=2;then C0=>[0,1], C1=>[2,3], C2=>[4,5], C3=>[6,7];

具体代码是在ZookeeperConsumerConnector#rebalance，

          for (consumerThreadId <- consumerThreadIdSet) {            val myConsumerPosition = curConsumers.indexOf(consumerThreadId)            assert(myConsumerPosition >= 0)            val startPart = nPartsPerConsumer*myConsumerPosition + myConsumerPosition.min(nConsumersWithExtraPart)            val nParts = nPartsPerConsumer + (if (myConsumerPosition + 1 > nConsumersWithExtraPart) 0 else 1)            /**             *   Range-partition the sorted partitions to consumers for better locality.             *  The first few consumers pick up an extra partition, if any.             */            if (nParts <= 0)              warn("No broker partitions consumed by consumer thread " + consumerThreadId + " for topic " + topic)            else {              for (i <- startPart until startPart + nParts) {                val partition = curPartitions(i)                info(consumerThreadId + " attempting to claim partition " + partition)                // 每个partition只会由一条thread消费，不会有并发的问题                addPartitionTopicInfo(currentTopicRegistry, topicDirs, partition, topic, consumerThreadId)                // record the partition ownership decision                partitionOwnershipDecision += ((topic, partition) -> consumerThreadId)              }            }          }

consumerThreadId生成的逻辑在TopicCount，

  protected def makeConsumerThreadIdsPerTopic(consumerIdString: String,                                            topicCountMap: Map[String,  Int]) = {    val consumerThreadIdsPerTopicMap = new mutable.HashMap[String, Set[String]]()    for ((topic, nConsumers) <- topicCountMap) {      val consumerSet = new mutable.HashSet[String]      assert(nConsumers >= 1)      for (i <- 0 until nConsumers)        consumerSet += consumerIdString + "-" + i      consumerThreadIdsPerTopicMap.put(topic, consumerSet)    }    consumerThreadIdsPerTopicMap  }

更进一步，consumer只从分配给自己的partition的leader broker那里pull message，代码是在ConsumerFetcherManager，

  def startConnections(topicInfos: Iterable[PartitionTopicInfo], cluster: Cluster) {    leaderFinderThread = new LeaderFinderThread(consumerIdString + "-leader-finder-thread")    leaderFinderThread.start()    inLock(lock) {      partitionMap = topicInfos.map(tpi => (TopicAndPartition(tpi.topic, tpi.partitionId), tpi)).toMap      this.cluster = cluster      noLeaderPartitionSet ++= topicInfos.map(tpi => TopicAndPartition(tpi.topic, tpi.partitionId))      cond.signalAll()    }  }

  private class LeaderFinderThread(name: String) extends ShutdownableThread(name) {    // thread responsible for adding the fetcher to the right broker when leader is available    override def doWork() {      val leaderForPartitionsMap = new HashMap[TopicAndPartition, Broker]      lock.lock()      try {        while (noLeaderPartitionSet.isEmpty) {          trace("No partition for leader election.")          cond.await()        }        trace("Partitions without leader %s".format(noLeaderPartitionSet))        val brokers = getAllBrokersInCluster(zkClient)        val topicsMetadata = ClientUtils.fetchTopicMetadata(noLeaderPartitionSet.map(m => m.topic).toSet,                                                            brokers,                                                            config.clientId,                                                            config.socketTimeoutMs,                                                            correlationId.getAndIncrement).topicsMetadata        if(logger.isDebugEnabled) topicsMetadata.foreach(topicMetadata => debug(topicMetadata.toString()))        topicsMetadata.foreach { tmd =>          val topic = tmd.topic          tmd.partitionsMetadata.foreach { pmd =>            val topicAndPartition = TopicAndPartition(topic, pmd.partitionId)            if(pmd.leader.isDefined && noLeaderPartitionSet.contains(topicAndPartition)) {              val leaderBroker = pmd.leader.get              leaderForPartitionsMap.put(topicAndPartition, leaderBroker)              noLeaderPartitionSet -= topicAndPartition            }          }        }      } catch {        case t: Throwable => {            if (!isRunning.get())              throw t /* If this thread is stopped, propagate this exception to kill the thread. */            else              warn("Failed to find leader for %s".format(noLeaderPartitionSet), t)          }      } finally {        lock.unlock()      }      try {        addFetcherForPartitions(leaderForPartitionsMap.map{          case (topicAndPartition, broker) =>            topicAndPartition -> BrokerAndInitialOffset(broker, partitionMap(topicAndPartition).getFetchOffset())}        )      } catch {        case t: Throwable => {          if (!isRunning.get())            throw t /* If this thread is stopped, propagate this exception to kill the thread. */          else {            warn("Failed to add leader for partitions %s; will retry".format(leaderForPartitionsMap.keySet.mkString(",")), t)            lock.lock()            noLeaderPartitionSet ++= leaderForPartitionsMap.keySet            lock.unlock()          }        }      }      shutdownIdleFetcherThreads()      Thread.sleep(config.refreshLeaderBackoffMs)    }  }

不只是consumer，producer也是只将message发送给partition的leader broker。代码在DefaultEventHandler#partitionAndCollate，

  def partitionAndCollate(messages: Seq[KeyedMessage[K,Message]]): Option[Map[Int, collection.mutable.Map[TopicAndPartition, Seq[KeyedMessage[K,Message]]]]] = {    val ret = new HashMap[Int, collection.mutable.Map[TopicAndPartition, Seq[KeyedMessage[K,Message]]]]    try {      for (message <- messages) {        val topicPartitionsList = getPartitionListForTopic(message)        // 根据message找出partition        val partitionIndex = getPartition(message.topic, message.partitionKey, topicPartitionsList)        // 根据partition找出leader        val brokerPartition = topicPartitionsList(partitionIndex)        // postpone the failure until the send operation, so that requests for other brokers are handled correctly        val leaderBrokerId = brokerPartition.leaderBrokerIdOpt.getOrElse(-1)        var dataPerBroker: HashMap[TopicAndPartition, Seq[KeyedMessage[K,Message]]] = null        ret.get(leaderBrokerId) match {          case Some(element) =>            dataPerBroker = element.asInstanceOf[HashMap[TopicAndPartition, Seq[KeyedMessage[K,Message]]]]          case None =>            dataPerBroker = new HashMap[TopicAndPartition, Seq[KeyedMessage[K,Message]]]            ret.put(leaderBrokerId, dataPerBroker)        }        val topicAndPartition = TopicAndPartition(message.topic, brokerPartition.partitionId)        var dataPerTopicPartition: ArrayBuffer[KeyedMessage[K,Message]] = null        dataPerBroker.get(topicAndPartition) match {          case Some(element) =>            dataPerTopicPartition = element.asInstanceOf[ArrayBuffer[KeyedMessage[K,Message]]]          case None =>            dataPerTopicPartition = new ArrayBuffer[KeyedMessage[K,Message]]            dataPerBroker.put(topicAndPartition, dataPerTopicPartition)        }        dataPerTopicPartition.append(message)      }      Some(ret)    }catch {    // Swallow recoverable exceptions and return None so that they can be retried.      case ute: UnknownTopicOrPartitionException => warn("Failed to collate messages by topic,partition due to: " + ute.getMessage); None      case lnae: LeaderNotAvailableException => warn("Failed to collate messages by topic,partition due to: " + lnae.getMessage); None      case oe: Throwable => error("Failed to collate messages by topic, partition due to: " + oe.getMessage); None    }  }

上面还有一个之前一直没提到的问题，就是message最后是要发送到哪个partition，

  /**   * Retrieves the partition id and throws an UnknownTopicOrPartitionException if   * the value of partition is not between 0 and numPartitions-1   * @param topic The topic   * @param key the partition key   * @param topicPartitionList the list of available partitions   * @return the partition id   */  private def getPartition(topic: String, key: Any, topicPartitionList: Seq[PartitionAndLeader]): Int = {    val numPartitions = topicPartitionList.size    if(numPartitions <= 0)      throw new UnknownTopicOrPartitionException("Topic " + topic + " doesn't exist")    val partition =      if(key == null) { // 可以支持没有message key        // If the key is null, we don't really need a partitioner        // So we look up in the send partition cache for the topic to decide the target partition        val id = sendPartitionPerTopicCache.get(topic)        id match {          case Some(partitionId) =>            // directly return the partitionId without checking availability of the leader,            // since we want to postpone the failure until the send operation anyways            partitionId          case None =>            val availablePartitions = topicPartitionList.filter(_.leaderBrokerIdOpt.isDefined)            if (availablePartitions.isEmpty)              throw new LeaderNotAvailableException("No leader for any partition in topic " + topic)            val index = Utils.abs(Random.nextInt) % availablePartitions.size            val partitionId = availablePartitions(index).partitionId            sendPartitionPerTopicCache.put(topic, partitionId)            partitionId        }      } else // 使用Partitioner        partitioner.partition(key, numPartitions)    if(partition < 0 || partition >= numPartitions)      throw new UnknownTopicOrPartitionException("Invalid partition id: " + partition + " for topic " + topic +        "; Valid values are in the inclusive range of [0, " + (numPartitions-1) + "]")    trace("Assigning message of topic %s and key %s to a selected partition %d".format(topic, if (key == null) "[none]" else key.toString, partition))    partition  }

这样看来，Kafka的message都是KeyedMessage。

/** * A topic, key, and value. * If a partition key is provided it will override the key for the purpose of partitioning but will not be stored. */case class KeyedMessage[K, V](val topic: String, val key: K, val partKey: Any, val message: V) {  if(topic == null)    throw new IllegalArgumentException("Topic cannot be null.")    def this(topic: String, message: V) = this(topic, null.asInstanceOf[K], null, message)    def this(topic: String, key: K, message: V) = this(topic, key, key, message)    def partitionKey = {    if(partKey != null)      partKey    else if(hasKey)      key    else      null    }    def hasKey = key != null}

消息语义

Clearly there are multiple possible message delivery guarantees that could be provided:

At most once—Messages may be lost but are never redelivered.
At least once—Messages are never lost but may be redelivered.
Exactly once—this is what people actually want, each message is delivered once and only once.

It's worth noting that this breaks down into two problems: the durability guarantees for publishing a message and the guarantees when consuming a message.

从producer的角度来看这个问题。试想这样一种场景，当producer发送message给到broker并已经保存下来，这时producer与broker之间的网络出现问题，producer没有收到broker的ack，那么producer如何是好？因为它不知道broker到底成功保存message了没有。当前版本的Kafka并没有解决这个问题。那这个问题怎么破？一种方案是，producer给每个请求加上一个PrimaryKey，做幂等，出现上述情况，可以拿着这个PK来问broker，这个请求处理成功了没，或者问都不问，直接将请求重发一遍，在broker端再做幂等。

排除上面这种极端情况，要解决发送消息Exactly once还是相对容易的。Kafka提供了request.required.acks与producer.type这样的配置来让用户根据应用情况灵活选择。一旦收到broker的ack，那么这条message就是committed的，只要持有该partition的所有ISR没有全挂，那么这条message对consumer来说就是可见的。request.required.acks的含义如下，

0，不需要等待broker的ack。低延迟，但是可靠性无法保证。
1，只需要收到leader的ack即可（需要持久化到硬盘？）。
-1，需要所有的ISR的ack。

等待ack是通过BlockingChannel这种阻塞式的SocketChannel实现的，SyncProducer，

  /**   * Common functionality for the public send methods   */  private def doSend(request: RequestOrResponse, readResponse: Boolean = true): Receive = {    lock synchronized {      verifyRequest(request)      getOrMakeConnection()      var response: Receive = null      try {        blockingChannel.send(request)        if(readResponse)          // 阻塞式的          response = blockingChannel.receive()        else          trace("Skipping reading response")      } catch {        case e: java.io.IOException =>          // no way to tell if write succeeded. Disconnect and re-throw exception to let client handle retry          disconnect()          throw e        case e: Throwable => throw e      }      response    }  }

而producer.type其实与消息语义无关，只是出于性能考虑。This parameter specifies whether the messages are sent asynchronously in a background thread.在Producer中实现，

  /**   * Sends the data, partitioned by key to the topic using either the   * synchronous or the asynchronous producer   * @param messages the producer data object that encapsulates the topic, key and message data   */  def send(messages: KeyedMessage[K,V]*) {    lock synchronized {      if (hasShutdown.get)        throw new ProducerClosedException      recordStats(messages)      sync match {        case true => eventHandler.handle(messages)        case false => asyncSend(messages)      }    }  }

从consumer的角度来看这个问题。本质上来看，这其实是一个事务的问题，你有两步操作，消费消息和记录消费进度。事务这个问题想必大家都很清楚就不多说了，那么来看下Kafka的处理。Kafka依旧是提供了比较灵活的机制，既可以手动调用ConsumerConnector#commitOffsets，也可以设置auto.commit.enable和auto.commit.interval.ms来使用自动commit，但正如官方ConsumerGroupExample所说的，使用auto commit，消息有可能会重放，也就是At least once。

The ‘auto.commit.interval.ms’ setting is how often updates to the consumed offsets are written to ZooKeeper. Note that since the commit frequency is time based instead of # of messages consumed, if an error occurs between updates to ZooKeeper on restart you will get replayed messages.

发起请求的fetchedOffset以及broker返回的已消费的consumedOffset都保存在PartitionTopicInfo，commitOffset时会将consumedOffset保存到ZK中。下图为消息消费流程的简图。

                             +--------------------+                             |PartitionTopicInfo  |  +----------------+   poll  | +---------------+  |  enqueue  +-----------------------+  |ConsumerIterator| --------->| BlockingQueue |  |  <------  | ConsumerFetcherThread |  +----------------+         | +---------------+  |           +-----------------------+                             +--------------------+                       ||                                                           BlcokingChannel||                                                                          ||                                                                    +-------------+                                                                    | KafkaServer |                                                                    +-------------+

消息顺序

Kafka只提供了a total order over messages within a partition，部分有序，当你需要完全有序时，可以通过设置该topic只有一个partition来实现。

消息清理

Kafka提供了两种处理消息文件的策略，删除或者压缩，通过cleanup.policy或者log.cleanup.policy来配置，前者是topic粒度，后者是partition粒度，topic粒度的设置会覆盖partition粒度的设置。cleanup的时机通过log.retention.{minutes,hours}或者log.retention.bytes这样的配置来控制。注意，消息的cleanup是不会管消息是否已经被消费的。

消息堆积

如上述，消息是会被清理的，所以不存在消息堆积的问题。

消息优先级

不支持。优先级感觉是一个比较变态的问题。

消息过滤

不支持。可以通过增加messageTag的方式实现，fetch request带上tag，broker收到请求后根据tag做下过滤后再返回给consumer。

事务消息

不支持。一般事务消息的实现需要至少给producer提供两个接口，precommitMessage与commitMessage（两段提交），producer的流程一般是1. precommitMessage 2. doBiz 3. commitMessage。Kafka只提供了commitMessage接口所以无法支持。

参考资料

https://kafka.apache.org/documentation.html
http://sites.computer.org/debull/A12june/pipeline.pdf
http://alibaba.github.io/RocketMQ-docs/document/design/RocketMQ_design.pdf

0 0