Spark----管道的概念与例子

来源：互联网发布：网络诈骗的危害性编辑：程序博客网时间：2024/05/29 20:00

一、管道的概念
学习来源:Spark官网对管道的描述
1、管道的几个基本概念：
(1)DataFrame:其实就是DataSet的集合，可以理解为，dataset为某张表里面的一行，那么行的集合就是一张表，所以dataFrame就是一张表，但是表的field比较丰富，可以有向量，在很多算法里面，向量的使用是必不可少的;
(2)Transformer:作用就是将DataFrame A 变成 DataFrame B，可以是增加一列或者是少了某些数据等的操作；
(3)Estimator:利用算法将DataFrame进行训练，得到模型的操作;
(4)Pipeline:包含一系列的有序的Transformer和Estimator 操作的工作流。

2、管道工作的流程
管道工作的流程由两个过程来组成,下面以文本分类处理的管道工作流程为例子，
(1)形成预测模型
如下图:蓝色为Transformer，红色为Estimator；
第一行，解释了文本分类在管道里的三个状态的转换。先经过Tokenizer 进行Transform，将得到的dataframe送入下一个操作进行hashingTF进行Transform，最终将输出的dataframe送到下个阶段来产生期望的预测算法
第二行，以数据流的角度解释了第一行。Tokenizer的Transform操作，其实就是将文本中的每一行变成我们想要的单词而已，接着hashingTF的Transform操作其实就是给每个单词赋予独一无二的特征向量，最后利用算法结合数据得到我们想要的预测模型。
这里写图片描述
(2)利用预测模型
如下图，蓝色为Transformer，红色为Estimator；
在管道的上一个阶段我们已经得到预测模型了，下一个阶段的目标自然是进行预言。
第一行，同样解释了管道的状态转换。对进来的文本进行转换的过程。
第二行，以数据流的角度解释。待检查的数据进来之后同样被转换成向量送进了模型中预测，最终生成dataframe。
这里写图片描述

需要注意的是，上面的例子是基于线性的状态，若是非线性的，还需要改天再研究下。

3、个人肤浅的问题

1、什么时候用到管道，因为从上面的解释来说，管道是比算法更上一层的抽象，平时直接用ml package和mllib package下的算法已经很方便了。我学习管道的原因是因为，在学习K-Means算法，ALS算法，TF-IDF算法，Word2Vec的时候没有发现demo用了管道，但是开始学Ramdom forest 的时候出现了管道。
2、怎么用管道

二、例子
随机森林的例子如下：

import org.apache.spark.ml.Pipeline;import org.apache.spark.ml.PipelineModel;import org.apache.spark.ml.PipelineStage;import org.apache.spark.ml.classification.RandomForestClassificationModel;import org.apache.spark.ml.classification.RandomForestClassifier;import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;import org.apache.spark.ml.feature.*;import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;import org.apache.spark.sql.SparkSession;// Load and parse the data file, converting it to a DataFrame.Dataset<Row> data = spark.read().format("libsvm").load("data/mllib/sample_libsvm_data.txt");// Index labels, adding metadata to the label column.// Fit on whole dataset to include all labels in index.StringIndexerModel labelIndexer = new StringIndexer()  .setInputCol("label")  .setOutputCol("indexedLabel")  .fit(data);// Automatically identify categorical features, and index them.// Set maxCategories so features with > 4 distinct values are treated as continuous.VectorIndexerModel featureIndexer = new VectorIndexer()  .setInputCol("features")  .setOutputCol("indexedFeatures")  .setMaxCategories(4)  .fit(data);// Split the data into training and test sets (30% held out for testing)Dataset<Row>[] splits = data.randomSplit(new double[] {0.7, 0.3});Dataset<Row> trainingData = splits[0];Dataset<Row> testData = splits[1];// Train a RandomForest model.RandomForestClassifier rf = new RandomForestClassifier()  .setLabelCol("indexedLabel")  .setFeaturesCol("indexedFeatures");// Convert indexed labels back to original labels.IndexToString labelConverter = new IndexToString()  .setInputCol("prediction")  .setOutputCol("predictedLabel")  .setLabels(labelIndexer.labels());// Chain indexers and forest in a PipelinePipeline pipeline = new Pipeline()  .setStages(new PipelineStage[] {labelIndexer, featureIndexer, rf, labelConverter});// Train model. This also runs the indexers.PipelineModel model = pipeline.fit(trainingData);// Make predictions.Dataset<Row> predictions = model.transform(testData);// Select example rows to display.predictions.select("predictedLabel", "label", "features").show(5);// Select (prediction, true label) and compute test errorMulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator()  .setLabelCol("indexedLabel")  .setPredictionCol("prediction")  .setMetricName("accuracy");double accuracy = evaluator.evaluate(predictions);System.out.println("Test Error = " + (1.0 - accuracy));RandomForestClassificationModel rfModel = (RandomForestClassificationModel)(model.stages()[2]);System.out.println("Learned classification forest model:\n" + rfModel.toDebugString());

使用ml工具时，数据会先经过清洗和转换，对于分类来说，将实验数据转化成 libsvm数据能极大的方便开发，因为格式本身已经固定好了。下面是关于libsvm的相关:
LIBSVM使用的数据格式
该软件使用的训练数据和检验数据文件格式如下：
: : …
其中是训练数据集的目标值，对于分类，它是标识某类的整数（支持多个类）；对于回归，是任意实数。是以1开始的整数，可以是不连续的；；为实数，也就是我们常说的自变量。检验数据文件中的label只用于计算准确度或误差，如果它是未知的，只需用一个数填写这一栏，也可以空着不填。在程序包中，还包括有一个训练数据实例：heart_scale，方便参考数据文件格式以及练习使用软件。

阅读全文

0 0