spark中mapPartitions双重循环或两次遍历(duplicate)

来源：互联网发布：centos关机命令编辑：程序博客网时间：2024/06/06 19:35

在spark当中通常需要对mapPartitions内部进行计算，这样可以在不进行网络传输的情况下，对数据进行局部计算

而mapPartitions中的迭代器为Iterator

scala中的Iterator只能进行一次迭代，使用过后就消失了，所以在mapPartitions中既不能两次遍历

如：一次mapPartitions求最大最小值

val it = Iterator(20, 40, 2, 50, 69, 90)println(“Maximum valued element ” + it.max) // 90println(“Minimum valued element ” + it.min) // 出错

同理，如果进行双重循环等操作，在内部循环第一次循环完毕时，外部循环也会直接跳出

(而这对于计算而言很重要 )

所以在使用两次遍历或双重循环时需要对Iterator进行拷贝

需要用到关键字 duplicate 和 iter.toList

示例如下：（计算KNN高斯核密度）

def gaussianKernel(iterator: Iterator[DenseVector[Double]]): Iterator[Tuple2[DenseVector[Double], Double]] = {    var res = List[(DenseVector[Double], Double)]()    val (bakiter, curiter) = iterator.duplicate    val (sizeiter, tmpiter) = bakiter.duplicate    val tmplist = tmpiter.toList    val curlist = curiter.toList    val size = sizeiter.size    val k = sqrt(size).toInt    curlist.foreach { cur =>      var sumtmp = 0.0      val abfDist = ArrayBuffer[Double]()      tmplist.foreach { tmp =>        abfDist += exp(-sum(pow(cur - tmp, 2)) / (2.0 * C))      }      val abfDistSorted = abfDist.sorted      for (i <- 0 until k) {        sumtmp += abfDistSorted(size - 1 - i)      }      res.::=(cur, sumtmp)    }    res.iterator  }

0 0