sparkmllib线性回归源码学习

来源：互联网发布：智能数据盒子编辑：程序博客网时间：2024/06/11 03:51

回归的理解

回归其实就是对已知公式的未知参数进行估计（梯度下降，迭代思想，最小二乘也通可通过这种方法求解）。大家可以简单的理解为，在给定训练样本点和已知的公式后，对于一个或多个未知参数，机器会自动枚举参数的所有可能取值（对于多个参数要枚举它们的不同组合），直到找到那个最符合样本点分布的参数（或参数组合）。（当然，实际运算有一些优化算法，肯定不会去枚举的）。注意，回归的前提是公式已知，否则回归无法进行。而现实生活中哪里有已知的公式啊（G=m*g 也是牛顿被苹果砸了脑袋之后碰巧想出来的不是？哈哈），因此回归中的公式基本都是数据分析人员通过看大量数据后猜测的（其实大多数是拍脑袋想出来的，嗯…）。根据这些公式的不同，回归分为线性回归和非线性回归。线性回归中公式都是“一次”的（一元一次方程，二元一次方程…），而非线性则可以有各种形式（N元N次方程，log方程等等）。

数学模型

典型案例
数学表达

批量梯度下降算法

损失函数
J(θ)的极小值问题-> 梯度下降法:

推导过程

随机梯度下降算法

当样本集数据量m很大时，批量梯度下降算法每迭代一次的复杂度为O(mn)，复杂度很高。
随机梯度下降伪代码
即每读取一条样本，就迭代对进行更新，这样迭代一次的算法复杂度为O(n)。

源码分析

MLlib的线性回归模型采用随机梯度下降算法来优化目标函数，MLlib实现了分布式的随机梯度下降算法，其分布方法是：在每次迭代中，随机抽取一定比例的样本作为当前迭代的计算样本；对计算样本中的每一个样本分别计算梯度（分布式计算每个样本的梯度）；然后再通过聚合函数对样本的梯度进行累加，得到该样本的平均梯度及损失；最后根据最新的梯度及上次迭代的权重进行权重的更新。
计算过程

LinearRegressionWithSGD //**线性回归伴生对象**   LinearRegressionWithSGD是基于随机梯度下降的线性回归的伴生对象train //train是LinearRegressionWithSGD对象的静态方法，该方法是根据设置线性回归参数 新建线性回归类，并执行run方法进行训练LinearRegressionWithSGD //LinearRegressionWithSGD类run //un是LinearRegressionWithSGD线性回归类继承GeneralizedLinearAlgorithm广义线性回归类的run方法，该方法主要用optimizer.optimize方法进行权重的优化计算。runMiniBatchSGD // **权重优化计算**GradientDescent类继承了optimizer，GradientDescent类的optimizer方法其实是调用了GradientDescent伴生对象的runMiniBatchSGD方法 ，runMiniBatchSGD方法是根据训练样本迭代运行随机梯度计算，得到最优权重；每次迭代主要计算样本的梯度及更新梯度。gradient.compute //**梯度计算**调用LeastSquaresGradient.compute方法，该方法是计算样本的梯度updater.compute //**权重更新**调用SimpleUpdater.compute方法，该方法是权重的更新方法LinearRegressionModel //线性回归模型predict //GeneralizedLinearModel类predict方法，该方法是根据线性回归模型计算样本的预测值

object LinearRegressionWithSGD {  def train(      input: RDD[LabeledPoint], //训练样本      numIterations: Int, //迭代次数      stepSize: Double, //每次迭代步长      miniBatchFraction: Double, //每次参与计算的样本比例      initialWeights: Vector //初始化权重      ): LinearRegressionModel = {    new LinearRegressionWithSGD(stepSize, numIterations, 0.0, miniBatchFraction)      .run(input, initialWeights)  }}

class LinearRegressionWithSGD private[mllib] (    private var stepSize: Double,    private var numIterations: Int,    private var regParam: Double,    private var miniBatchFraction: Double)  extends GeneralizedLinearAlgorithm[LinearRegressionModel] with Serializable {//采用最小平方损失函数的梯度下降方法用于线性回归  private val gradient = new LeastSquaresGradient()  //采用简单梯度更新方法，无正则化  private val updater = new SimpleUpdater()  //根据梯度下降方法、梯度更新方法新建梯度优化计算方法  @Since("0.8.0")  override val optimizer = new GradientDescent(gradient, updater)    .setStepSize(stepSize)    .setNumIterations(numIterations)    .setRegParam(regParam)    .setMiniBatchFraction(miniBatchFraction)

GeneralizedLinearAlgorithm类：

    //特征维度    if (numFeatures < 0) {      numFeatures = input.map(_.features.size).first()    }    //输入样本检测    if (input.getStorageLevel == StorageLevel.NONE) {      logWarning("The input data is not directly cached, which may hurt performance if its"        + " parent RDDs are also uncached.")    }     //输入样本检测    // Check the data properties before running the optimizer    if (validateData && !validators.forall(func => func(input))) {      throw new SparkException("Input validation failed.")    }    。。。。。。。。。。。。。。。。。。。。。。。。    val initialWeightsWithIntercept = if (addIntercept && numOfLinearPredictor == 1) {      appendBias(initialWeights)    } else {      /** If `numOfLinearPredictor > 1`, initialWeights already contains intercepts. */      initialWeights    }    val weightsWithIntercept = optimizer.optimize(data, initialWeightsWithIntercept)    val intercept = if (addIntercept && numOfLinearPredictor == 1) {      weightsWithIntercept(weightsWithIntercept.size - 1)    } else {      0.0    }

def runMiniBatchSGD(      data: RDD[(Double, Vector)],      gradient: Gradient,      updater: Updater,      stepSize: Double,      numIterations: Int,      regParam: Double,      miniBatchFraction: Double,      initialWeights: Vector,      convergenceTol: Double): (Vector, Array[Double]) = {      //核心部分：迭代直至收敛     while (!converged && i <= numIterations) {      val bcWeights = data.context.broadcast(weights)      // Sample a subset (fraction miniBatchFraction) of the total data      // compute and sum up the subgradients on this subset (this is one map-reduce)      val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)        .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(          seqOp = (c, v) => {            // c: (grad, loss, count), v: (label, features)            val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))            (c._1, c._2 + l, c._3 + 1)          },          combOp = (c1, c2) => {            // c: (grad, loss, count)            (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)          })      if (miniBatchSize > 0) {        /**         * lossSum is computed using the weights from the previous iteration         * and regVal is the regularization value computed in the previous iteration as well.         */        stochasticLossHistory += lossSum / miniBatchSize + regVal        val update = updater.compute(          weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble),          stepSize, i, regParam)        weights = update._1        regVal = update._2        previousWeights = currentWeights        currentWeights = Some(weights)        if (previousWeights != None && currentWeights != None) {          converged = isConverged(previousWeights.get,            currentWeights.get, convergenceTol)        }      } else {        logWarning(s"Iteration ($i/$numIterations). The size of sampled batch is zero")      }      i += 1    }}

核心：给定一组初始权重向量，梯度下降迭代直至收敛，得到收敛函数，就是我们的拟合函数。
以上步骤得到线性回归模型。

class LinearRegressionModel @Since("1.1.0") (    @Since("1.0.0") override val weights: Vector, //(a1,a2,...an)    @Since("0.8.0") override val intercept: Double)//偏置值，a0  extends GeneralizedLinearModel(weights, intercept) with RegressionModel with Serializable  with Saveable with PMMLExportable

得到模型之后，我们要进行预测：

 def predict(testData: RDD[Vector]): RDD[Double] = {    // A small optimization to avoid serializing the entire model. Only the weightsMatrix    // and intercept is needed.    val localWeights = weights //权重    val bcWeights = testData.context.broadcast(localWeights) //对常量进行广播    val localIntercept = intercept //偏置    //对每行数据进行预测    testData.mapPartitions { iter =>      val w = bcWeights.value      iter.map(v => predictPoint(v, w, localIntercept))    }  }

//Y = w*X + b override protected def predictPoint(      dataMatrix: Vector,      weightMatrix: Vector,      intercept: Double): Double = {    weightMatrix.asBreeze.dot(dataMatrix.asBreeze) + intercept  }

实例：

def validatePrediction(predictions: Seq[Double], input: Seq[LabeledPoint]) {    val numOffPredictions = predictions.zip(input).count { case (prediction, expected) =>      // A prediction is off if the prediction is more than 0.5 away from expected value.      math.abs(prediction - expected.label) > 0.5    }    // At least 80% of the predictions should be on.    assert(numOffPredictions < input.length / 5)  } // Test if we can correctly learn Y = 10*X1 + 10*X2  test("linear regression without intercept") {    val testRDD = sc.parallelize(LinearDataGenerator.generateLinearInput(      0.0, Array(10.0, 10.0), 100, 42), 2).cache()    val linReg = new LinearRegressionWithSGD().setIntercept(false)    linReg.optimizer.setNumIterations(1000).setStepSize(1.0)    val model = linReg.run(testRDD)    assert(model.intercept === 0.0)    val weights = model.weights    assert(weights.size === 2)    assert(weights(0) >= 9.0 && weights(0) <= 11.0)    assert(weights(1) >= 9.0 && weights(1) <= 11.0)    val validationData = LinearDataGenerator.generateLinearInput(      0.0, Array(10.0, 10.0), 100, 17)    val validationRDD = sc.parallelize(validationData, 2).cache()    // Test prediction on RDD.    validatePrediction(model.predict(validationRDD.map(_.features)).collect(), validationData)    // Test prediction on Array.    validatePrediction(validationData.map(row => model.predict(row.features)), validationData)  }// Test if we can correctly learn Y = 3 + 10*X1 + 10*X2  test("linear regression") {    val testRDD = sc.parallelize(LinearDataGenerator.generateLinearInput(      3.0, Array(10.0, 10.0), 100, 42), 2).cache()    val linReg = new LinearRegressionWithSGD().setIntercept(true)    linReg.optimizer.setNumIterations(1000).setStepSize(1.0)    val model = linReg.run(testRDD)    assert(model.intercept >= 2.5 && model.intercept <= 3.5)    val weights = model.weights    assert(weights.size === 2)    assert(weights(0) >= 9.0 && weights(0) <= 11.0)    assert(weights(1) >= 9.0 && weights(1) <= 11.0)    val validationData = LinearDataGenerator.generateLinearInput(      3.0, Array(10.0, 10.0), 100, 17)    val validationRDD = sc.parallelize(validationData, 2).cache()    // Test prediction on RDD.    validatePrediction(model.predict(validationRDD.map(_.features)).collect(), validationData)    // Test prediction on Array.    validatePrediction(validationData.map(row => model.predict(row.features)), validationData)  }  test("model save/load") {    val model = LinearRegressionSuite.model    val tempDir = Utils.createTempDir()    val path = tempDir.toURI.toString    // Save model, load it back, and compare.    try {      model.save(sc, path)      val sameModel = LinearRegressionModel.load(sc, path)      assert(model.weights == sameModel.weights)      assert(model.intercept == sameModel.intercept)    } finally {      Utils.deleteRecursively(tempDir)    }  }

0 0