spark点点滴滴 —— 开发运行scala程序
来源:互联网 发布:java异常有哪些分类 编辑:程序博客网 时间:2024/06/02 07:03
概述
环境:spark 2.0.1
运行模式:spark on yarn
我们用scala语言编写的程序如何在spark集群上提交任务并运行呢,我们知道一个java程序spark提交命令如下:
spark-submit --class className --name jobName --master yarn-cluster ./xxx-SNAPSHOT.jar
其中clasName是入口main函数所在类,即程序入口;
jobName是提交的任务名称, ./xxx-SNAPSHOT.jar是程序打包而成的jar文件。
准备
要提交scala程序到spark上,首先我们需要会用IDE编写scala程序。
对于scala程序,简单语法和开发方式如下:
1. 入口main函数通过object关键字对象定义,可以参考scala基础7 —— scala的静态类型(object)
2. 针对scala程序的开发和打包,参考scala —— maven scala项目开发
程序实践
我们采用官网的LR代码示例来演示开发和提交任务,官网代码参见http://spark.apache.org/docs/latest/ml-pipeline.html。
创建工程
通过intellij创建一个工程,具体步骤以及配置参见##准备中的第2项,完整的pom.xml配置如下:
<properties> <scala.version>2.11.8</scala.version> <spark.version>2.0.1</spark.version> </properties> <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.11</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> <scope>compile</scope> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-reflect</artifactId> <version>${scala.version}</version> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <version>2.15.2</version> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.3</version> <configuration> <source>1.7</source> <target>1.7</target> <encoding>UTF-8</encoding> </configuration> </plugin> </plugins> </build>
注意spark以及scala的版本,不然会有问题,需要和spark集群保持一致。
我们会同时演示java版本和scala版本的代码和提交,完整的工程视图如下:
其中LRJava是java的class,运行java版本代码;LRScala是scala的class,运行scala版本代码。
java代码
LRJava代码如下:
import org.apache.spark.ml.classification.LogisticRegression;import org.apache.spark.ml.classification.LogisticRegressionModel;import org.apache.spark.ml.linalg.VectorUDT;import org.apache.spark.ml.linalg.Vectors;import org.apache.spark.ml.param.ParamMap;import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;import org.apache.spark.sql.RowFactory;import org.apache.spark.sql.SparkSession;import org.apache.spark.sql.types.DataTypes;import org.apache.spark.sql.types.Metadata;import org.apache.spark.sql.types.StructField;import org.apache.spark.sql.types.StructType;import java.util.Arrays;import java.util.List;public class LRJava { public static void main(String[] args) { SparkSession spark = SparkSession .builder() .appName("lr_java_test") .getOrCreate(); // Prepare training data. List<Row> dataTraining = Arrays.asList( RowFactory.create(1.0, Vectors.dense(0.0, 1.1, 0.1)), RowFactory.create(0.0, Vectors.dense(2.0, 1.0, -1.0)), RowFactory.create(0.0, Vectors.dense(2.0, 1.3, 1.0)), RowFactory.create(1.0, Vectors.dense(0.0, 1.2, -0.5)) ); StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false, Metadata.empty()) }); Dataset<Row> training = spark.createDataFrame(dataTraining, schema);// Create a LogisticRegression instance. This instance is an Estimator. LogisticRegression lr = new LogisticRegression();// Print out the parameters, documentation, and any default values. System.out.println("LogisticRegression parameters:\n" + lr.explainParams() + "\n");// We may set parameters using setter methods. lr.setMaxIter(10).setRegParam(0.01);// Learn a LogisticRegression model. This uses the parameters stored in lr. LogisticRegressionModel model1 = lr.fit(training);// Since model1 is a Model (i.e., a Transformer produced by an Estimator),// we can view the parameters it used during fit().// This prints the parameter (name: value) pairs, where names are unique IDs for this// LogisticRegression instance. System.out.println("Model 1 was fit using parameters: " + model1.parent().extractParamMap());// We may alternatively specify parameters using a ParamMap. ParamMap paramMap = new ParamMap() .put(lr.maxIter().w(20)) // Specify 1 Param. .put(lr.maxIter(), 30) // This overwrites the original maxIter. .put(lr.regParam().w(0.1), lr.threshold().w(0.55)); // Specify multiple Params.// One can also combine ParamMaps. ParamMap paramMap2 = new ParamMap() .put(lr.probabilityCol().w("myProbability")); // Change output column name ParamMap paramMapCombined = paramMap.$plus$plus(paramMap2);// Now learn a new model using the paramMapCombined parameters.// paramMapCombined overrides all parameters set earlier via lr.set* methods. LogisticRegressionModel model2 = lr.fit(training, paramMapCombined); System.out.println("Model 2 was fit using parameters: " + model2.parent().extractParamMap());// Prepare test documents. List<Row> dataTest = Arrays.asList( RowFactory.create(1.0, Vectors.dense(-1.0, 1.5, 1.3)), RowFactory.create(0.0, Vectors.dense(3.0, 2.0, -0.1)), RowFactory.create(1.0, Vectors.dense(0.0, 2.2, -1.5)) ); Dataset<Row> test = spark.createDataFrame(dataTest, schema);// Make predictions on test documents using the Transformer.transform() method.// LogisticRegression.transform will only use the 'features' column.// Note that model2.transform() outputs a 'myProbability' column instead of the usual// 'probability' column since we renamed the lr.probabilityCol parameter previously. Dataset<Row> results = model2.transform(test); Dataset<Row> rows = results.select("features", "label", "myProbability", "prediction"); for (Row r: rows.collectAsList()) { System.out.println("(" + r.get(0) + ", " + r.get(1) + ") -> prob=" + r.get(2) + ", prediction=" + r.get(3)); } spark.stop(); }}
代码时官网上LR代码,入口main函数。
scala代码
LRScala代码如下:
import org.apache.spark.sql.{SparkSession}import org.apache.spark.ml.classification.LogisticRegressionimport org.apache.spark.ml.linalg.{Vector, Vectors}import org.apache.spark.ml.param.ParamMapimport org.apache.spark.sql.Rowobject LRScala { def main (args: Array[String]){ val spark = SparkSession.builder() .appName("lr_scala_test") .getOrCreate() // Prepare training data from a list of (label, features) tuples. val training = spark.createDataFrame(Seq( (1.0, Vectors.dense(0.0, 1.1, 0.1)), (0.0, Vectors.dense(2.0, 1.0, -1.0)), (0.0, Vectors.dense(2.0, 1.3, 1.0)), (1.0, Vectors.dense(0.0, 1.2, -0.5)) )).toDF("label", "features") // Create a LogisticRegression instance. This instance is an Estimator. val lr = new LogisticRegression() // Print out the parameters, documentation, and any default values. println("LogisticRegression parameters:\n" + lr.explainParams() + "\n") // We may set parameters using setter methods. lr.setMaxIter(10) .setRegParam(0.01) // Learn a LogisticRegression model. This uses the parameters stored in lr. val model1 = lr.fit(training) // Since model1 is a Model (i.e., a Transformer produced by an Estimator), // we can view the parameters it used during fit(). // This prints the parameter (name: value) pairs, where names are unique IDs for this // LogisticRegression instance. println("Model 1 was fit using parameters: " + model1.parent.extractParamMap) // We may alternatively specify parameters using a ParamMap, // which supports several methods for specifying parameters. val paramMap = ParamMap(lr.maxIter -> 20) .put(lr.maxIter, 30) // Specify 1 Param. This overwrites the original maxIter. .put(lr.regParam -> 0.1, lr.threshold -> 0.55) // Specify multiple Params. // One can also combine ParamMaps. val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability") // Change output column name. val paramMapCombined = paramMap ++ paramMap2 // Now learn a new model using the paramMapCombined parameters. // paramMapCombined overrides all parameters set earlier via lr.set* methods. val model2 = lr.fit(training, paramMapCombined) println("Model 2 was fit using parameters: " + model2.parent.extractParamMap) // Prepare test data. val test = spark.createDataFrame(Seq( (1.0, Vectors.dense(-1.0, 1.5, 1.3)), (0.0, Vectors.dense(3.0, 2.0, -0.1)), (1.0, Vectors.dense(0.0, 2.2, -1.5)) )).toDF("label", "features") // Make predictions on test data using the Transformer.transform() method. // LogisticRegression.transform will only use the 'features' column. // Note that model2.transform() outputs a 'myProbability' column instead of the usual // 'probability' column since we renamed the lr.probabilityCol parameter previously. model2.transform(test) .select("features", "label", "myProbability", "prediction") .collect() .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) => println(s"($features, $label) -> prob=$prob, prediction=$prediction") } spark.stop() }}
特别要注意的是,scala的main入口函数定义在object对象里
编译提交
执行mvn package编译,成功,生成的jar包是:spark.2.0.test-1.0-SNAPSHOT.jar
有了jar包,我们就可以开始提交任务了:
提交LRJava的java版本程序到spark上的命令如下:
spark-submit --class LRJava --name LR_java_name --master yarn-cluster --executor-memory 8G --driver-cores 2 ./spark.2.0.test-1.0-SNAPSHOT.jar
提交LRScala的scala版本程序到spark上的命令如下:
spark-submit --class LRScala --name LR_scala_name --master yarn-cluster --executor-memory 8G --driver-cores 2 ./spark.2.0.test-1.0-SNAPSHOT.jar
我们可以看到,提交scala程序和提交java版本的程序几乎一样,不一样的是指定class时候的类名不一样。java版本的是java程序main入口,scala版本的是scala程序的main入口。
以上,就完成了scala的程序开发并提交到spark集群上。
- spark点点滴滴 —— 开发运行scala程序
- spark点点滴滴 —— 运行scala任务异常处理
- scala 开发spark程序
- 第94讲, 使用Scala开发集群运行的Spark 实现在线黑名单过滤程序
- Eclipse+scala-plugin开发第一个spark程序WordCount并部署运行
- Spark学习——利用Scala语言开发Spark应用程序
- scala IDE for Eclipse开发Spark程序
- 基于spark运行scala程序(sbt和命令行方法)
- spark on yarn运行scala单词统计程序出错
- 第一个spark scala程序——wordcount
- Scala IDE 搭建Spark 2开发环境和运行例子
- Spark Streaming开发入门——WordCount(Java&Scala)
- windows中用scala-IDE开发spark—— WordCount
- Spark Streaming开发入门——WordCount(Java&Scala)
- spark scala 数据处理程序
- IntelliJ Idea开发spark程序及运行
- eclipse开发spark程序配置本地运行
- scala spark开发模式
- gerrit install plugin
- HDU 2087剪花布条
- VS2013打开2015的项目报错 The specified task executable location ... csc.exe" is invalid.
- 106. Construct Binary Tree from Inorder and Postorder Traversal
- 软件测试报告应该注意哪些事项
- spark点点滴滴 —— 开发运行scala程序
- 非农数据惊人利好 美联储6月加息概率大增,黄金的好日子到头了
- 博客之新增上传文件(包括图片)功能
- myeclipse安装activiti插件(不需要网络)
- jQuery同步Ajax带来的UI线程阻塞问题及解决办法
- filter与servlet的比较
- C++第六次实验-项目2
- codeforces 807A Is it rated?
- Kafka相比于HDFS的优势