SparkSteaming 实现图片流式抠图

来源:互联网 发布:淘宝运营面试技巧 编辑:程序博客网 时间:2024/05/12 06:07

SparkSteaming 实现图片流式抠图

1.   思路

由于抠图的算法是C++写的,将它编译成.so,打入jar包,运行时动态加载已经实现。抠图已经在hadoop2.2.0和spark0.9上实现,现在要对它进行流式抠图。目前想到的就是两种方式

1)        每次需要背景图片的时候,就去hdfs上读,这种方式比较好实现,就是I/O操作的时间消耗比较高。

2)        将背景图片存入内存中,进行broadcast,这样就不用每次去读,目前已基本实现。

2.   问题

2.1 第一种方式

在第一种方法中,图片数量一大,就会出现有些图片读取不到的问题,我的发现是,在读取图片的函数中,正常的大小都不为0,获得的分片大小为0,如下:

key =MSingle_1_123 , InputSplit.length = 984977 , FileSplit.length = 984977

key =MSingle_0_123 , InputSplit.length = 993543 , FileSplit.length = 993543

key =MSingle_1_116 , InputSplit.length = 988187 , FileSplit.length = 988187

key =MSingle_1_119 , InputSplit.length = 983012 , FileSplit.length = 983012

key =MSingle_1_122 , InputSplit.length = 984769 , FileSplit.length = 984769

key =MSingle_2_116 , InputSplit.length = 0 , FileSplit.length = 0

key =MSingle_2_116 , InputSplit.length = 0 , FileSplit.length = 0

key =MSingle_2_116 , InputSplit.length = 0 , FileSplit.length = 0

key =MSingle_2_116 , InputSplit.length = 0 , FileSplit.length = 0

key =MSingle_2_116 , InputSplit.length = 0 , FileSplit.length = 0

读入的src图片为0的话,那么抠图的时候,少了原图片,肯定报空指针错误了。

 

错误日志如下:

14/05/09 19:49:24 WARN TaskSetManager: Losswas due to java.lang.NullPointerException

java.lang.NullPointerException

 

14/05/09 19:49:24 INFO TaskSetManager:Starting task 17.0:1 as TID 135 on executor 2: slave3 (PROCESS_LOCAL)

14/05/09 19:49:24 INFO TaskSetManager:Serialized task 17.0:1 as 11920 bytes in 0 ms

14/05/09 19:49:25 WARN TaskSetManager: LostTID 135 (task 17.0:1)

14/05/09 19:49:25 INFO TaskSetManager: Losswas due to java.lang.NullPointerException[duplicate 1]

14/05/09 19:49:25 INFO TaskSetManager:Starting task 17.0:1 as TID 136 on executor 2: slave3 (PROCESS_LOCAL)

14/05/09 19:49:25 INFO TaskSetManager:Serialized task 17.0:1 as 11920 bytes in 0 ms

14/05/09 19:49:25 WARN TaskSetManager: LostTID 136 (task 17.0:1)

14/05/09 19:49:25 INFO TaskSetManager: Losswas due to java.lang.NullPointerException[duplicate 2]

14/05/09 19:49:25 INFO TaskSetManager:Starting task 17.0:1 as TID 137 on executor 2: slave3 (PROCESS_LOCAL)

14/05/09 19:49:25 INFO TaskSetManager:Serialized task 17.0:1 as 11920 bytes in 0 ms

14/05/09 19:49:25 WARN TaskSetManager: LostTID 137 (task 17.0:1)

14/05/09 19:49:25 INFO TaskSetManager: Losswas due to java.lang.NullPointerException[duplicate 3]

14/05/09 19:49:25 ERROR TaskSetManager:Task 17.0:1 failed 4 times; aborting job

14/05/09 19:49:25 INFO TaskSchedulerImpl:Remove TaskSet 17.0 from pool

14/05/09 19:49:25 INFO DAGScheduler: Failedto run saveAsNewAPIHadoopFile at SparkStreamQimage.scala:52

b = 16

Rdd.size is MappedRDD[257] at map atSparkStreamQimage.scala:50

14/05/09 19:49:25 INFO JobScheduler:Starting job streaming job 1399636130000 ms.0 from job set of time1399636130000 ms

14/05/09 19:49:25 ERROR JobScheduler: Errorrunning job streaming job 1399636128000 ms.0

org.apache.spark.SparkException: Job aborted: Task 17.0:1 failed 4 times (most recent failure:Exception failure: java.lang.NullPointerException)

       at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)

       atorg.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)

        atscala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

       atorg.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)

       atorg.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)

       atorg.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)

       at scala.Option.foreach(Option.scala:236)

       atorg.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)

       atorg.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)

       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

       at akka.actor.ActorCell.invoke(ActorCell.scala:456)

       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)

       at akka.dispatch.Mailbox.run(Mailbox.scala:219)

       atakka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)

       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

       atscala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

       atscala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2.2    第二种方式

在这种方式中,spark在加载背景图片的时候,只有第一次的时候比较慢,大概要20秒左右,以后读取的时候只要6秒左右。

背景图片是存放在RDD[(Text,Qimage)]中,但是在使用lookup函数的时候,会报错:java.io.NotSerializableException:org.apache.hadoop.io.Text.

Lookup函数原型,来源spark官方API.

Def lookup(key: K): Seq[V]

Return the list of values in the RDD for key key. Thisoperation is done efficiently if the RDD has a known partitioner by onlysearching the partition that the key maps to.

Lookup函数源码,来源spark 0.9的源代码.

/**

   *Return the list of values in the RDD for key `key`. This operation is doneefficiently if the

   *RDD has a known partitioner by only searching the partition that the key mapsto.

   */

  deflookup(key: K): Seq[V] = {

   self.partitioner match {

     case Some(p) =>

       val index = p.getPartition(key)

       def process(it: Iterator[(K, V)]): Seq[V] = {

         val buf = new ArrayBuffer[V]

         for ((k, v) <- it if k == key) {

           buf += v

         }

         buf

       }

       val res = self.context.runJob(self, process _, Array(index), false)

       res(0)

     case None =>

       self.filter(_._1 == key).map(_._2).collect()

    }

  }

         这里我将RDD[(Text,Qimage)]转化成了RDD[(String,Qimage)],然后再调用lookup函数,成功是成功,但是在运行的时候,消耗的时间比较多。

运行这种方式的时候,没有处理的图片数量比第一种还要多,处理时间还要慢,数量较小的时候都没有问题。同时,也出现了第一种出现的问题,就是原图片没有读进来,debug输出如下:

key = bg_8 , InputSplit.length = 962073 , FileSplit.length = 962073
key = MSingle_1_123 , InputSplit.length = 0 , FileSplit.length = 0
image_src is null
key = MSingle_1_123 , InputSplit.length = 0 , FileSplit.length = 0
image_src is null
key = MSingle_1_123 , InputSplit.length = 0 , FileSplit.length = 0
image_src is null
key = MSingle_1_123 , InputSplit.length = 0 , FileSplit.length = 0
image_src is null
key = MSingle_2_117 , InputSplit.length = 0 , FileSplit.length = 0
key = MSingle_2_116 , InputSplit.length = 963401 , FileSplit.length = 963401
key = bg_0 , InputSplit.length = 987983 , FileSplit.length = 987983

这个问题至今还没有解决,有没有大神提点下。

3.   附图片说明

3.1 运行图

这张图的时间是按消耗的时间排序,写文件是最费时间的!!

3.2    结果

这里input中输入了200张png格式的图片,但是在output中,只有197张图片,剩下的一个是_SUCCESS文件,也就是说,丢了3张图片

3.3    错误


0 0