Spark操作HBase问题:java.io.IOException: Non-increasing Bloom keys
来源:互联网 发布:淘宝销量技巧 编辑:程序博客网 时间:2024/06/14 05:28
1 问题描述
在使用Spark BulkLoad数据到HBase时遇到以下问题:
17/05/19 14:47:26 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 12.0 (TID 79, bydslave5, executor 3): java.io.IOException: Non-increasing Bloom keys: 80a01055HAXMTXG10100001KEY_VOLTAGE_T_C_PWR after af401055HAXMTXG10100001KEY_VOLTAGE_TEC_PWR at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.appendGeneralBloomfilter(StoreFile.java:911) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:947) at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:199) at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:152) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
那么是在什么时候出现的呢?在运行完下面语句
val rdd = sc.textFile("/data/produce/2015/service.log.2017-04-24-08").map(_.split("@")).map{x => (DigestUtils.md5Hex(x(0)+x(1)).substring(0,3)+x(0)+x(1),x(2))}.map{x=>{val kv:KeyValue = new KeyValue(Bytes.toBytes(x._1),Bytes.toBytes("v"),Bytes.toBytes("value"),Bytes.toBytes(x._2+""));(new ImmutableBytesWritable(kv.getKey),kv)}}rdd.saveAsNewAPIHadoopFile("/tmp/data1",classOf[ImmutableBytesWritable],classOf[KeyValue],classOf[HFileOutputFormat],job.getConfiguration())
从报错信息来看是因为key没有按照递增的顺序进行排列,可能是BloomFilter对key的排序有要求,但是我们知道key的无序是因为spark在shuffle阶段并没有像MapReduce那样强制排序,所以要解决这个问题我们需要手动地为数据进行排序,只需要对rdd执行sortBy即可。
2 问题解决
下面语句是增加排序的语句,经过测试运行通过
val rdd = sc.textFile("/data/produce/2015/service.log.2017-04-24-08").map(_.split("@")).map{x => (DigestUtils.md5Hex(x(0)+x(1)).substring(0,3)+x(0)+x(1),x(2))}.sortBy(x =>x._1).map{x=>{val kv:KeyValue = new KeyValue(Bytes.toBytes(x._1),Bytes.toBytes("v"),Bytes.toBytes("value"),Bytes.toBytes(x._2+""));(new ImmutableBytesWritable(kv.getKey),kv)}}rdd.saveAsNewAPIHadoopFile("/tmp/data1",classOf[ImmutableBytesWritable],classOf[KeyValue],classOf[HFileOutputFormat],job.getConfiguration())
阅读全文
0 0
- Spark操作HBase问题:java.io.IOException: Non-increasing Bloom keys
- Hadoop与HBase中遇到的问题(续)java.io.IOException: Non-increasing Bloom keys异常
- hbase问题-java.io.IOException: error or interrupt while splitting logs
- Spark在shuffle数据的时候遇到的问题:java.io.IOException: Connection reset by peer
- 异常:java.io.IOException: java.lang.reflect.InvocationTargetException&Hbase编程
- HBase异常:java.io.IOException: Connection reset by peer
- Hbase java.io.IOException: The connection has to be unmanaged.
- Spark:java.io.IOException: No space left on device
- java.io.IOException: No FileSystem for scheme: file spark hadoop
- java.io.IOException翻译
- ClientAbortException: java.io.IOException
- java.io.IOException
- import java.io.IOException
- ClientAbortException:java.io.IOException
- 3.java.io.IOException
- java.io.IOException: java.lang.reflect.InvocationTargetException 问题的解决
- sqoop 在hbase运行时出现job: java.io.IOException: java.lang.reflect.InvocationTargetException解决办法
- java.io.IOException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.hbase.client.
- HTML标签
- 【Linux】信号量--实现进程间通信
- 面试官最常提问的30个问题
- 转 如何转载博客
- java打印99乘法表
- Spark操作HBase问题:java.io.IOException: Non-increasing Bloom keys
- Packaging on the Web
- 要想富先修路 (并查集)
- 留着以后慢慢做的树形DP(题表)
- 时间复杂度
- 11.struts2_通用标签
- 数据结构 ——二叉树 前序、中序、后序、层次遍历及非递归实现 查找、统计个数、比较、求深度的递归实现
- PullRefreshLayout+WebView实现下拉刷新
- selenium2java微信支付宝购买功能测试用例