大数据学习[10]:Kafka新手入门

来源:互联网 发布:刺客列传网络剧 编辑:程序博客网 时间:2024/05/18 03:45

这里写图片描述

摘要:主要是学习Kafka文档,对Kafka官网的Quickstart进行了阅读并试着翻译。
来源:http://kafka.apache.org/quickstart

Quickstart

快速入门

This tutorial assumes you are starting fresh and have no existing Kafka or ZooKeeper data. Since Kafka console scripts are different for Unix-based and Windows platforms, on Windows platforms use bin\windows\ instead of bin/, and change the script extension to .bat.
这个教程假设你是刚开始学习的新手,没有Kafka或Zookeeper数据。由于Kafka的控制台脚本对于Unix与Windows平台是不同的,在Windows平台上采用bin\windows目录下而不是bin/目录,要把脚本换成.bat的扩展名。

Step 1: Download the code

Download the 0.11.0.1 release and un-tar it.

第一步:下载代码

下载0.11.1.1版本与解压

> tar -xzf kafka_2.11-0.11.0.1.tgz> cd kafka_2.11-0.11.0.1

Step 2: Start the server

Kafka uses ZooKeeper so you need to first start a ZooKeeper server if you don’t already have one. You can use the convenience script packaged with kafka to get a quick-and-dirty single-node ZooKeeper instance.

第二步:启动服务

因为Kafka运行需要Zookeeper,所以如果你没有启动ZooKeeper服务,首先要启动它。也可以使用kafka的简便脚本去获得一个临时凑合的单节点ZooKeeper 实例。

 > bin/zookeeper-server-start.sh config/zookeeper.properties[2013-04-22 15:01:37,495] INFO Reading configuration from: config/zookeeper.properties (org.apache.zookeeper.server.quorum.QuorumPeerConfig)...

Now start the Kafka server:
现在开启Kafaka服务:

1234 > bin/kafka-server-start.sh config/server.properties[2013-04-22 15:01:47,028] INFO Verifying properties (kafka.utils.VerifiableProperties)[2013-04-22 15:01:47,051] INFO Property socket.send.buffer.bytes is overridden to 1048576 (kafka.utils.VerifiableProperties)...

Step 3: Create a topic

Let’s create a topic named “test” with a single partition and only one replica:

第三步:创建一个主题

创建一个名为test的主题,只有一个分区与一个副本:

> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

We can now see that topic if we run the list topic command:
如果运行列出主题命令,可以看到这个主题:

> bin/kafka-topics.sh --list --zookeeper localhost:2181test

Alternatively, instead of manually creating topics you can also configure your brokers to auto-create topics when a non-existent topic is published to.
或者,不是想手动去创建主题,也可以通过配置kafka的broker来实现,当主题没有发布时,会自动创建主题;

Step 4: Send some messages

Kafka comes with a command line client that will take input from a file or from standard input and send it out as messages to the Kafka cluster. By default, each line will be sent as a separate message.
Run the producer and then type a few messages into the console to send to the server.

第四步:发送信息

Kafka自带一个命令行客户端,可以接受来自文件和标准输入,并将接收到的内容作为信息发送到Kafka集群中。在默认的情况下,一行作为一条单独信息发送。
运行生产者然后在控制台输入几条信息信息发送给服务.

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic testThis is a messageThis is another message

Step 5: Start a consumer

Kafka also has a command line consumer that will dump out messages to standard output.

第五步:启动消费者

Kafka也有一个命令的消费者,当启动消费者,信息将会从标准输出显示出来。

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginningThis is a messageThis is another message

If you have each of the above commands running in a different terminal then you should now be able to type messages into the producer terminal and see them appear in the consumer terminal.
All of the command line tools have additional options; running the command with no arguments will display usage information documenting them in more detail.
如果以上的生产者命令与消费者命令分别运行在不同的终端上,那么你可以在生产者终端上输入信息同时可以看到在消息者终端上出现。
所有的命令行工具都有其它的选项,运行不带参数的命令时,将会显示使用信息文档。

Step 6: Setting up a multi-broker cluster

So far we have been running against a single broker, but that’s get feel for . For Kafka, a single broker is just a cluster of size one, so nothing much changes other than starting a few more broker instances. But just to get feel for it, let’s expand our cluster to three nodes (still all on our local machine).
First we make a config file for each of the brokers (on Windows use the copy command instead):

第六步:配置多broker集群

现在已经可以运行单broker,但没劲。对于Kafka,单个broker只不过是节点数据为1的集群,与运行多节点没有太大的差别。当我们意识到这个的时候,让我们去扩展集群到三个节点。

> cp config/server.properties config/server-1.properties> cp config/server.properties config/server-2.properties

Now edit these new files and set the following properties:
现在编辑这些新文件如下属性设置:

config/server-1.properties:        broker.id=1        listeners=PLAINTEXT://:9093        log.dir=/tmp/kafka-logs-1 config/server-2.properties:        broker.id=2        listeners=PLAINTEXT://:9094        log.dir=/tmp/kafka-logs-2

The broker.id property is the unique and permanent name of each node in the cluster. We have to override the port and log directory only because we are running these all on the same machine and we want to keep the brokers from all trying to register on the same port or overwrite each other’s data.
We already have Zookeeper and our single node started, so we just need to start the two new nodes:
broker.id是集群中每个节点唯一的不变的名字。覆盖默认[这里指上面运行的单节点设置],设置端口与日志路径,因为要在一台机器上运行这些,这样设置可以防止所有节点试着去注册同一商品或覆盖数据。
已经有Zookeeper与一个节点已经运行了,现在只需要开两个新节点就可以了。

> bin/kafka-server-start.sh config/server-1.properties &...> bin/kafka-server-start.sh config/server-2.properties &>...

Now create a new topic with a replication factor of three:
现在创建一个带有三份副本的新主题:

> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic

Okay but now that we have a cluster how can we know which broker is doing what? To see that run the “describe topics” command:
好了,现在我们有了一个集群,可是我们怎么broker是怎样工作的和正在做什么呢?运行行describe topics命令去查看:

> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topicTopic:my-replicated-topic   PartitionCount:1    ReplicationFactor:3 Configs:    Topic: my-replicated-topic  Partition: 0    Leader: 1   Replicas: 1,2,0 Isr: 1,2,0

Here is an explanation of output. The first line gives a summary of all the partitions, each additional line gives information about one partition. Since we have only one partition for this topic there is only one line.
● “leader” is the node responsible for all reads and writes for the given partition. Each node will be the leader for a randomly selected portion of the partitions.
● “replicas” is the list of nodes that replicate the log for this partition regardless of whether they are the leader or even if they are currently alive.
● “isr” is the set of “in-sync” replicas. This is the subset of the replicas list that is currently alive and caught-up to the leader.
Note that in my example node 1 is the leader for the only partition of the topic.
这里对输出解释一下,第一行给出了所有分区的汇总,下面的每一行表示一个分区的信息。由于我们对于这个topic只有一个分区,所于只有一行数据。
leader 是负责所有分区的读写的节点,在分区中随机选举成为leader.
replicas是一系列节点,这些节点复制了些分区日志,不管它们是leader或还活动状态的。
isr是‘in-sync’副本的服务器集合,这样集合中的服务器目前处于活动状态,并与leader数据一致。
注意到在我们的例子中,对于主题分区节点1是leader。
We can run the same command on the original topic we created to see where it is:
我们可以在创建原主题运行同样的命令,看看是不是这样:

> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic testTopic:test  PartitionCount:1    ReplicationFactor:1 Configs:    Topic: test Partition: 0    Leader: 0   Replicas: 0 Isr: 0

So there is no surprise there—the original topic has no replicas and is on server 0, the only server in our cluster when we created it.
毫不惊讶,这个主题没有副本,在服务0上,在我们集群中唯一的一个服务。
Let’s publish a few messages to our new topic:
让我们发布几条信息到我们新主题:

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my-replicated-topic...my test message 1my test message 2^C

Now let’s consume these messages:
现在让我们消费这些信息:

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --from-beginning --topic my-replicated-topic> ...> my test message 1> my test message 2> ^C

Now let’s test out fault-tolerance. Broker 1 was acting as the leader so let’s kill it:
现在让我们测试一下容错情况。Broker1充当着leader角色,所以把这个节点杀掉。

> ps aux | grep server-1.properties7564 ttys002    0:15.91 /System/Library/Frameworks/JavaVM.framework/Versions/1.8/Home/bin/java...> > kill -9 7564

On Windows use:
在Windows系统使用:

> wmic process get processid,caption,commandline | find "java.exe" | find "server-1.properties"java.exe    java  -Xmx1G -Xms1G -server -XX:+UseG1GC ... build\libs\kafka_2.11-0.11.0.1.jar"  kafka.Kafka config\server-1.properties    644> taskkill /pid 644 /f

Leadership has switched to one of the slaves and node 1 is no longer in the in-sync replica set:
领导权已转向了slave节点中的其中一个了,节点1不在在in-sync的副本集合中了:

> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topicTopic:my-replicated-topic   PartitionCount:1    ReplicationFactor:3 Configs:    Topic: my-replicated-topic  Partition: 0    Leader: 2   Replicas: 1,2,0 Isr: 2,0

But the messages are still available for consumption even though the leader that took the writes originally is down:
但这些信息仍然可被消费掉的,即使leader宕机了。

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --from-beginning --topic my-replicated-topic...my test message 1my test message 2^C

Step 7: Use Kafka Connect to import/export data

Writing data from the console and writing it back to the console is a convenient place to start, but you’ll probably want to use data from other sources or export data from Kafka to other systems. For many systems, instead of writing custom integration code you can use Kafka Connect to import or export data.
Kafka Connect is a tool included with Kafka that imports and exports data to Kafka. It is an extensible tool that runs connectors, which implement the custom logic for interacting with an external system. In this quickstart we’ll see how to run Kafka Connect with simple connectors that import data from a file to a Kafka topic and export data from a Kafka topic to a file.

第七步:使用Kafka连接导入/导出数据

从控制台输入数据和把数据写回控制台比较方便,但你可能想让数据从其它源导入进来或从Kafka把数据导出到其它系统中去。对于很多系统,你可以使用Kafka连接导入导出数据,而不是自己写代码来集成。Kafka连接是一个使用Kafka去导入导出工具,它是运行连接器的扩展工具,可以实现自定义的逻辑进行与外部系统交互。在这个入门教程中,我们将会看到怎样使用Kafka简单的连接器从文件中导入数据并从Kafka主题中导出数据到文件中。
First, we’ll start by creating some seed data to test with:
首先,我们创建一些测试数据:

> echo -e "foo\nbar" > test.txt

Next, we’ll start two connectors running in standalone mode, which means they run in a single, local, dedicated process. We provide three configuration files as parameters. The first is always the configuration for the Kafka Connect process, containing common configuration such as the Kafka brokers to connect to and the serialization format for data. The remaining configuration files each specify a connector to create. These files include a unique connector name, the connector class to instantiate, and any other configuration required by the connector.
然后,我们用standalone模式开启两个连接器,这意味着运行在单机本地的专用进程。我们提供三个配置文件作为参数。第一个依然是对于Kafka连接进程的配置,包含一些公共的配置例如broker的连接与数据可序列化形式。剩下来的两个参数分别指定一个连接的创建。这些文件包含着一个唯一连接名,这个连接类型的实例,和一些这些连接器所需的其它配置。

bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties config/connect-file-sink.properties

These sample configuration files, included with Kafka, use the default local cluster configuration you started earlier and create two connectors: the first is a source connector that reads lines from an input file and produces each to a Kafka topic and the second is a sink connector that reads messages from a Kafka topic and produces each as a line in an output file.
During startup you’ll see a number of log messages, including some indicating that the connectors are being instantiated. Once the Kafka Connect process has started, the source connector should start reading lines from test.txt and producing them to the topic connect-test, and the sink connector should start reading messages from the topic connect-test and write them to the file test.sink.txt. We can verify the data has been delivered through the entire pipeline by examining the contents of the output file:
这些样例配置文件,包含使用Kafka, 使用了刚开始时的默认本地集群配置,并创建了两个连接器:第一个是源连接器,从输入文件读取行数据和产生主题;第二个是一个sink连接器,从Kafka主题读取信息和将每一行输出到文件中。
在启动期间,你可以看到大量的日志信息,包括这些连接器被实现化的提示信息。一旦Kafka连接进程启动,这个源连接器就开始从test.txt文件中读取数据,并生成主题connect-test; 在sink连接器开始从connect-test主题中读取信息并把信息写到文件test.sink.txt中。我可以通过检查输出文件内容来验证这些数据是否已经通过整个管道发送传输。

> cat test.sink.txt> foo> bar

Note that the data is being stored in the Kafka topic connect-test, so we can also run a console consumer to see the data in the topic (or use custom consumer code to process it):
注意这样数据被储存在Kafka主题connect-test中,所以我们也可以运行一个控制台消费者去查看这些在主题中的数据(或使用自定义代码来处理它):

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic connect-test --from-beginning{"schema":{"type":"string","optional":false},"payload":"foo"}{"schema":{"type":"string","optional":false},"payload":"bar"}...

The connectors continue to process data, so we can add data to the file and see it move through the pipeline:
这些连接器一直在处理数据,因此可以向文件增加数据并通过管道可以看到数据的传输:

 echo "Another line" >> test.txt

You should see the line appear in the console consumer output and in the sink file.
你可看到这行数据显示在控制台与sink文件中。

Step 8: Use Kafka Streams to process data

Kafka Streams is a client library for building mission-critical real-time applications and microservices, where the input and/or output data is stored in Kafka clusters. Kafka Streams combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka’s server-side cluster technology to make these applications highly scalable, elastic, fault-tolerant, distributed, and much more. This quickstart example will demonstrate how to run a streaming application coded in this library.

第八步:使用Kafka流去处理数据

Kafka流是一个创建关键性实时应用与微服务库,输入或输出数据都被储存在Kafka的集群中。
Kafka流组合写与布署标准JAVA和Scala在客户端利用Kafka服务集群技术的高弹性,容错,分布式等等应用的简单性。这个快速教程例子(http://kafka.apache.org/0110/documentation/streams/quickstart)将演示了使用这个包怎么运行一个流用。

【作者:happyprince, http://blog.csdn.net/ld326/article/details/78118441 】