sparkmlib使用Pipeline实现简单的逻辑回归

来源：互联网发布：网络建设与管理方案编辑：程序博客网时间：2024/05/20 16:40

MLib 机器学习算法的标准API可以很方便的把多个算法整合到一个pipeline中，并可以把整个过程形象的比如机器学习算法流；

Pipeline包括三个阶段：

第一阶段：Tokenier会把每个一个文件分成word

第二阶段：把word转化为特征向量

第三阶段：用特征向量和列别训练模型

下面就以一个简单的逻辑回归为例说明使用Pipeline开发；

定义一个类：

public class Person implements Serializable{public static final long serialVersionUID = 1L;public  long id;public  String text;public  double label;public Person(){};public Person(long id, String text) {this.id = id;this.text = text;}public Person(long id, String text, double label) {super();this.id = id;this.text = text;this.label = label;}public long getId() {return id;}public void setId(long id) {this.id = id;}public String getText() {return text;}public void setText(String text) {this.text = text;}public double getLabel() {return label;}public void setLabel(double label) {this.label = label;}@Overridepublic String toString() {return "Person [id=" + id + ", text=" + text + ", label=" + label + "]";}}

public static void main(String[] args) {SparkSession sparkSession = SparkSession      .builder()      .appName("loi").master("local[1]")      .getOrCreate();// Create an RDD of SparkIn objects from a text fileJavaRDD<String> dataRDD = sparkSession                   .read()                   . textFile("E:/sparkMlib/sparkMlib/src/mllib/testLine.txt")                   .javaRDD();JavaRDD<Person> dataPerson = dataRDD.map(new Function<String,Person>(){public Person call(String line) throws Exception {String[] parts = line.split(",");Person sparkIn = new Person();sparkIn.setId(Long.parseLong(parts[0]));sparkIn.setText(parts[1]);sparkIn.setLabel(Double.parseDouble(parts[2]));return sparkIn;}});// Apply a schema to an RDD of JavaBeans to get a DataFrameDataset<Row> training  = sparkSession.createDataFrame(dataPerson, Person.class);//把document组成wordsTokenizer tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words");//words特征向量化HashingTF hashigTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol()).setOutputCol("features");//选择模型LogisticRegression lr = new LogisticRegression().setMaxIter(30).setRegParam(0.001);//组建pipeline流Pipeline pipeline =new Pipeline().setStages(new PipelineStage[]{tokenizer,hashigTF,lr});//训练模型PipelineModel model = pipeline.fit(training);//组织预测数据Dataset<Row> test =sparkSession.createDataFrame(Arrays.asList(new Person(4,"spark i j k"),        new Person(5,"l m n"),new Person(6,"spark hadoop spark"),new Person(7,"apache hadoop")),Person.class);Dataset<Row> predictions =model.transform(test);for(Row r:predictions.select("id", "text", "probability", "prediction").collectAsList()){System.out.println(r.get(0)+":"+r.get(1)+":"+r.get(2)+":"+r.get(3));}}

以上就是用Pipeline实现的一个简单的逻辑回归模型；

0 0