Saprk 报错java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

来源:互联网 发布:剑三丐姐官方捏脸数据 编辑:程序博客网 时间:2024/05/17 05:06

17/08/25 18:35:42 WARN scheduler.TaskSetManager: Lost task 1.1 in stage 176.0 (TID 25544, 192.168.3.20, executor 290): java.lang.IllegalArgumentException:Size exceeds Integer.MAX_VALUE
        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
        at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
        at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
        at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
        at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:462)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:698)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

异常就是某个partition的数据量超过了Integer.MAX_VALUE(2147483647 = 2GB)。

之前跑的好好的任务,突然报错了,其他东西没有改,只有数据量增大,因此怀疑是和数据量增大有关,经过排查代码发现,之前为了减少小文件的输出,在输出的时候进行了.repartition(20)操作,经过查资料发现,Spark在对Rdd进行了一些限制,就是分区不能超过2g,修改pa0rtion数量,问题解决;

      这个限制有一定合理性。因为RDD中partition的操作是并发执行的,如果partition量过少,导致并发数过少,会限制计算效率。所以,基于这个限制,spark应用程序开发者会主动扩大partition数量,也就是加大并发量,最终提高计算性能(个人理解,可能有些片面)。


阅读全文
0 0
原创粉丝点击