sparkStreaming Kafka数据丢失问题
来源:互联网 发布:mysql免安装版怎么使用 编辑:程序博客网 时间:2024/06/03 18:28
针对Spark Streaming,为了保证数据尽量不丢失,自己管理offset
采用手动提交offset to zk的方案:
2017-10-26 11:46:22 Executor task launch worker-3 org.apache.spark.streaming.kafka.MyKafkaRDD INFO:Computing topic datamining, partition 8 offsets 3883 -> 3903
这里的offset错误出现一次,然后offset 在下一次错误的时候递增了,意味着中间的kafka数据丢失掉了。
TODO :需要测试自带checkpoint是否出现这个问题。
贴上一段日志,这里的数据在处理过程中,网络连接中断,导致consumer的消费连接出现中断,数据丢失,但是offset却递增 丢失。
2017-10-26 11:46:22 Executor task launch worker-1 kafka.utils.VerifiableProperties INFO:Property zookeeper.connect is overridden to 2017-10-26 11:46:22 task-result-getter-3 org.apache.spark.scheduler.TaskSetManager WARN:Lost task 2.0 in stage 494.0 (TID 1972, localhost): java.nio.channels.ClosedChannelExceptionat kafka.network.BlockingChannel.send(BlockingChannel.scala:110)at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:98)at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:83)at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:132)at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:132)at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:132)at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:131)at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:131)at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:131)at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:130)at org.apache.spark.streaming.kafka.MyKafkaRDD$KafkaRDDIterator.fetchBatch(MyKafkaRDD.scala:192)at org.apache.spark.streaming.kafka.MyKafkaRDD$KafkaRDDIterator.getNext(MyKafkaRDD.scala:208)at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)at org.apache.spark.scheduler.Task.run(Task.scala:89)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at java.lang.Thread.run(Thread.java:748)2017-10-26 11:46:22 Executor task launch worker-3 org.apache.spark.executor.Executor ERROR:Exception in task 3.0 in stage 494.0 (TID 1973)java.nio.channels.ClosedChannelExceptionat kafka.network.BlockingChannel.send(BlockingChannel.scala:110)at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:98)at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:83)at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:132)at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:132)at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:132)at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:131)at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:131)at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:131)at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:130)at org.apache.spark.streaming.kafka.MyKafkaRDD$KafkaRDDIterator.fetchBatch(MyKafkaRDD.scala:192)at org.apache.spark.streaming.kafka.MyKafkaRDD$KafkaRDDIterator.getNext(MyKafkaRDD.scala:208)at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)at org.apache.spark.scheduler.Task.run(Task.scala:89)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at java.lang.Thread.run(Thread.java:748)2017-10-26 11:46:22 dispatcher-event-loop-3 org.apache.spark.scheduler.TaskSetManager INFO:Starting task 2.0 in stage 496.0 (TID 1977, localhost, partition 2,ANY, 2004 bytes)2017-10-26 11:46:22 Executor task launch worker-3 org.apache.spark.executor.Executor INFO:Running task 2.0 in stage 496.0 (TID 1977)2017-10-26 11:46:22 Executor task launch worker-3 org.apache.spark.streaming.kafka.MyKafkaRDD INFO:Computing topic datamining, partition 8 offsets 3883 -> 39032017-10-26 11:46:22 Executor task launch worker-3 kafka.utils.VerifiableProperties INFO:Verifying properties
阅读全文
0 0
- sparkStreaming Kafka数据丢失问题
- sparkstreaming + kafka如何保证数据不丢失、不重复
- SparkStreaming消费Kafka数据遇到的问题
- SparkStreaming读取Kafka数据
- sparkstreaming消费kafka中的数据
- SparkStreaming无丢失读取Kafka且转为DataFrame
- sparkstreaming保存的kafka数据offset
- Kafka->SparkStreaming
- sparkstreaming+kafka
- sparkstreaming+kafka
- 记一次kafka数据丢失问题的排查
- kafka consumer防止数据丢失
- kafka consumer防止数据丢失
- Flume+Kafka+SparkStreaming整合
- SparkStreaming基于Kafka Direct
- SparkStreaming基于Kafka Receiver
- Flume+Kafka+SparkStreaming整合
- Spark+kafka+SparkStreaming实例
- 旋转数组的最小数字(查找和排序)
- okhttp3.0的简单使用
- CSS-鼠标样式
- Vscode下快速开始编写html的方法
- 帆软报表多源数据过滤为何不显示数据
- sparkStreaming Kafka数据丢失问题
- fabric源码解析19——ACC的安装
- 如何使用eclipse
- Linux学习:驱动层实现阻塞和非阻塞
- failed to start git process
- Solr加入中文分词器。
- 文件下载
- 7.mongo命令行运行JavaScript脚本
- 集合中ArrayList,Linklist,vector区别?