presto读取kafka数据
来源:互联网 发布:linux phpmyadmin 编辑:程序博客网 时间:2024/04/30 00:47
- 1-
- 1-1 配置方法
- 1-1-1 catalog配置
- 1-1-2 schema配置
- 1-2 启动和使用
- 1-1 配置方法
- 1-3 源码分析
- 1-3-1 metadata
- 1-3-2 任务切分
- 1-3-3 数据读取
1-
今天分析一下presto的kafka connector的主要原理和源码
1-1 配置方法
1-1-1 catalog配置
connector.name=kafkakafka.nodes=localhost:9090,localhost:9091,localhost:9092 //在单机启动3个broker,按照逗号进行分割kafka.table-names=tpch2.customer,tpch2.orders,tpch2.lineitem,tpch2.part,tpch2.partsupp,tpch2.supplier,tpch2.nation,tpch2.region //所有相关的表名,每一个都是kafka中的一个topickafka.hide-internal-columns=false //这个connector有很多的kafka相关的默认字段,这里配置是否在客户端显示
1-1-2 schema配置
以customer表为例,在etc/kafka目录下放置如下文件,这个文件会被在KafkaConnectorConfig读取
etc/kafka/tpch2.customer.json文件内容如下,太长,只列出其中一部分
{ "tableName": "customer", "schemaName": "tpch", "topicName": "tpch.customer", "key": { "dataFormat": "raw", "fields": [ { "name": "kafka_key", "dataFormat": "LONG", "type": "BIGINT", "hidden": "false" } ] }, "message": { "dataFormat": "json", "fields": [ { "name": "row_number", "mapping": "rowNumber", "type": "BIGINT" }, { "name": "customer_key", "mapping": "customerKey", "type": "BIGINT" }, { "name": "name", "mapping": "name", "type": "VARCHAR" }, { "name": "address", "mapping": "address", "type": "VARCHAR" }, { "name": "nation_key", "mapping": "nationKey", "type": "BIGINT" }, { "name": "phone", "mapping": "phone", "type": "VARCHAR" },
1-2 启动和使用
启动presto,客户端连接,指定catalog为kafka,
show schemas
Schema -------------------- information_schema tpch2 (2 rows)
use tpch2;
show tables;
Table ---------- customer lineitem nation orders part partsupp region supplier (8 rows)
desc customer;
Column | Type | Comment -------------------+---------+--------------------------------------------- kafka_key | bigint | row_number | bigint | customer_key | bigint | name | varchar | address | varchar | nation_key | bigint | phone | varchar | account_balance | double | market_segment | varchar | comment | varchar | _partition_id | bigint | Partition Id _partition_offset | bigint | Offset for the message within the partition _segment_start | bigint | Segment start offset _segment_end | bigint | Segment end offset _segment_count | bigint | Running message count per segment _key | varchar | Key text _key_corrupt | boolean | Key data is corrupt _key_length | bigint | Total number of key bytes _message | varchar | Message text _message_corrupt | boolean | Message data is corrupt _message_length | bigint | Total number of message bytes (21 rows)
如果没有配置表的json文件,则不会显示表的字段
查询:
select count(*), nation_key from customer group by nation_key;
1-3 源码分析
1-3-1 metadata
1-3-2 任务切分
KafkaSplitManager中的getsplits方法:
public ConnectorSplitSource getSplits(ConnectorTransactionHandle transaction, ConnectorSession session, ConnectorTableLayoutHandle layout) { KafkaTableHandle kafkaTableHandle = convertLayout(layout).getTable(); //随机选择一个配置的broker节点,用来获取topic的元数据信息 SimpleConsumer simpleConsumer = consumerManager.getConsumer(selectRandom(nodes)); TopicMetadataRequest topicMetadataRequest = new TopicMetadataRequest(ImmutableList.of(kafkaTableHandle.getTopicName())); //这里会调用kafka的case RequestKeys.MetadataKey => handleTopicMetadataRequest(request), //这个方法的分析见其他文章 TopicMetadataResponse topicMetadataResponse = simpleConsumer.send(topicMetadataRequest); ImmutableList.Builder<ConnectorSplit> splits = ImmutableList.builder(); for (TopicMetadata metadata : topicMetadataResponse.topicsMetadata()) { for (PartitionMetadata part : metadata.partitionsMetadata()) {//分区元数据,主要包括分区的id、分区的leader和分区的多个副本所在节点 log.debug("Adding Partition %s/%s", metadata.topic(), part.partitionId()); Broker leader = part.leader(); if (leader == null) { // Leader election going on... log.warn("No leader for partition %s/%s found!", metadata.topic(), part.partitionId()); continue; } HostAddress partitionLeader = HostAddress.fromParts(leader.host(), leader.port()); //根据partition的leader节点信息,获取一个consumer。也就是说,连接到任意的broker,都可以获取到元数据(是从zk获取吗?为什么不直接在zk获取?),但是要使用leader节点信息才能获取到partition的offset信息 SimpleConsumer leaderConsumer = consumerManager.getConsumer(partitionLeader); // Kafka contains a reverse list of "end - start" pairs for the splits //连接到leader,获取partition的offset,例如tpch2.customer的partition0的offset为[0,375] long[] offsets = findAllOffsets(leaderConsumer, metadata.topic(), part.partitionId()); for (int i = offsets.length - 1; i > 0; i--) { //使用这个parition的在leader上的offset区间,构造一个presto的split,在读取数据时,就连接到leader上,读取这个区间的消息 KafkaSplit split = new KafkaSplit( connectorId, metadata.topic(), kafkaTableHandle.getKeyDataFormat(), kafkaTableHandle.getMessageDataFormat(), part.partitionId(), offsets[i], offsets[i - 1], partitionLeader); splits.add(split); } } } return new FixedSplitSource(splits.build());
findAllOffsets方法
TopicAndPartition topicAndPartition = new TopicAndPartition(topicName, partitionId); // The API implies that this will always return all of the offsets. So it seems a partition can not have // more than Integer.MAX_VALUE-1 segments. // // This also assumes that the lowest value returned will be the first segment available. So if segments have been dropped off, this value // should not be 0. //kafka.api.OffsetRequest.LatestTime()表明获取的是最新的offset PartitionOffsetRequestInfo partitionOffsetRequestInfo = new PartitionOffsetRequestInfo(kafka.api.OffsetRequest.LatestTime(), Integer.MAX_VALUE); OffsetRequest offsetRequest = new OffsetRequest(ImmutableMap.of(topicAndPartition, partitionOffsetRequestInfo), kafka.api.OffsetRequest.CurrentVersion(), consumer.clientId()); //调用kafka的case RequestKeys.OffsetsKey => handleOffsetRequest(request)方法 OffsetResponse offsetResponse = consumer.getOffsetsBefore(offsetRequest); if (offsetResponse.hasError()) { short errorCode = offsetResponse.errorCode(topicName, partitionId); log.warn("Offset response has error: %d", errorCode); throw new PrestoException(KAFKA_SPLIT_ERROR, "could not fetch data from Kafka, error code is '" + errorCode + "'"); } return offsetResponse.offsets(topicName, partitionId);
1-3-3 数据读取
KafkaRecordSet的advanceNextPosition方法:
while (true) { if (cursorOffset >= split.getEnd()) { return endOfData(); // Split end is exclusive. } // Create a fetch request openFetchRequest(); while (messageAndOffsetIterator.hasNext()) { MessageAndOffset currentMessageAndOffset = messageAndOffsetIterator.next(); long messageOffset = currentMessageAndOffset.offset(); if (messageOffset >= split.getEnd()) { return endOfData(); // Past our split end. Bail. } if (messageOffset >= cursorOffset) { return nextRow(currentMessageAndOffset); } } messageAndOffsetIterator = null; }
阅读全文
0 0
- presto读取kafka数据
- flume 读取kafka 数据
- SparkStreaming读取Kafka数据
- storm-kafka数据读取问题
- logstash读取kafka数据插件
- flume读取日志数据写入kafka 然后kafka+storm整合
- flume读取日志数据写入kafka
- flume读取日志数据写入kafka
- storm trident读取kafka中数据
- spark streaming 读取kafka数据问题
- python hbase读取数据发送kafka
- spark streaming读取kafka数据,记录offset
- Spark Streaming 读取Kafka数据写入Elasticsearch
- 用storm-kafka读取kafka中的数据为什么会重复读取。
- presto源码分析(hive orc读取)
- presto对orc文件的读取
- presto的安装与部署(对接kafka)
- presto的安装与部署(对接kafka)
- 林轩田机器学习基石及技法课程中线性分类器的总结
- Spring学习笔记——AOP
- 最近的零零碎碎的做了python的笔记以便以后翻阅
- feof多读一次数据的问题
- eclipse导入数据驱动包
- presto读取kafka数据
- 【转】hadoop配置文件详解及相关操作
- JAVA-3-this笔记
- 正则表达式
- laravel5日志设置篇(3/3) – 精确到微秒及日志输出位置记录
- golang channel阻塞问题解决
- 从设计模式怎样提升设计
- 浅谈Java中的equals和==
- 欢迎使用CSDN-markdown编辑器