Spark2.x学习笔记:9、 Spark编程实例
来源:互联网 发布:淘宝助手 mac版下载 编辑:程序博客网 时间:2024/06/05 07:52
9、 Spark编程实例
9.1 SparkPi
package cn.hadronimport org.apache.spark.SparkConfimport org.apache.spark.SparkContextimport scala.math.randomobject SparkPi { def main(args: Array[String]): Unit = { val masterUrl = "local[1]" val conf=new SparkConf().setMaster(masterUrl).setAppName("SparkPi") val sc=new SparkContext(conf) //启动Task数,默认2个 val slices=if(args.length>0)args(0).toInt else 2 // n是迭代次数(默认2w次),Int.MaxValue是防止溢出 val n = math.min(100000L * slices, Int.MaxValue).toInt //默认两个patition,[1,100000]和[100001,20000] val count = sc.parallelize(1 until n, slices).map { i => //产生的点范围[-1,1],圆心是(0,0) val x = random * 2 - 1 val y = random * 2 - 1 //如果产生的点落在圆内计数1,否则计数0 if (x*x + y*y <= 1) 1 else 0 }.reduce(_ + _) println("Pi is roughly " + 4.0 * count / (n - 1)) }}
9.2 平均值
(1)生成数据
[root@node1 data]# vi genAge.sh [root@node1 data]# cat genAge.sh #!/bin/shfor i in {1..1000000};do echo -e $i'\t'$(($RANDOM%100))done;[root@node1 data]# sh genAge.sh > age.txt[root@node1 data]# tail -10 age.txt 999991 53999992 63999993 62999994 14999995 62999996 27999997 15999998 99999999 621000000 79
(2)上传到HDFS
[root@node1 data]# hdfs dfs -put age.txt input
(3)编写代码
AvgAge.scala
package cn.hadronimport org.apache.spark.SparkConfimport org.apache.spark.SparkContextobject AvgAge { def main(args:Array[String]) { if (args.length < 1){ println("Usage:AvgAge datafile") System.exit(1) } val conf = new SparkConf().setAppName("Spark Exercise:Average Age Calculator") val sc = new SparkContext(conf) val rdd = sc.textFile(args(0), 5); val count = rdd.count() val totalAge =rdd.map(line => line.split("\t")(1)) .map(age => Integer.parseInt(String.valueOf(age))) .collect() .reduce(_+_) println("Total Age:" + totalAge + ";Number of People:" + count ) val avgAge : Double = totalAge.toDouble / count.toDouble println("Average Age is " + avgAge) }}
(4)编译打包
(5)提交任务
spark-submit --
master yarn --
deploy-mode client --
class cn.hadron.AvgAge
/root/simpleSpark-1.0-SNAPSHOT.jar input/age.txt
[root@node1 ~]# spark-submit --master yarn --deploy-mode client --class cn.hadron.AvgAge /root/simpleSpark-1.0-SNAPSHOT.jar input/age.txt17/09/22 10:30:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable17/09/22 10:30:56 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.Total Age:49536760;Number of People:1000000 Average Age is 49.53676[root@node1 ~]#
9.3 TopK
(1)问题描述
查找一个文本文件中词频最高的前K个词。
比如有1个txt格式的汉姆雷特Hamlet.txt,统计该文件中词频 最高的前10个。
(2)上传数据
[root@node1 data]# hdfs dfs -put Hamlet.txt input[root@node1 data]# hdfs dfs -ls inputFound 3 items-rw-r--r-- 3 root supergroup 281498 2017-09-20 10:11 input/Hamlet.txt-rw-r--r-- 3 root supergroup 71 2017-08-27 09:18 input/books.txtdrwxr-xr-x - root supergroup 0 2017-08-13 09:33 input/emp.bak[root@node1 data]#
(3)spark-shell调试
[root@node1 data]# spark-shell17/09/20 10:12:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable17/09/20 10:13:01 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.017/09/20 10:13:02 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException17/09/20 10:13:04 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectExceptionSpark context Web UI available at http://192.168.80.131:4040Spark context available as 'sc' (master = local[*], app id = local-1505916766832).Spark session available as 'spark'.Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0 /_/Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)Type in expressions to have them evaluated.Type :help for more information.scala> val rdd1=sc.textFile("input/Hamlet.txt")rdd1: org.apache.spark.rdd.RDD[String] = input/Hamlet.txt MapPartitionsRDD[1] at textFile at <console>:24scala> rdd1.countres0: Long = 6878scala> val rdd2=rdd1.flatMap(x=>x.split(" ")).filter(_.size>1)rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at filter at <console>:26scala> rdd2.take(2)res1: Array[String] = Array(Hamlet, by)scala> val rdd3=rdd2.map(x=>(x,1))rdd3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:28scala> rdd3.take(2)res2: Array[(String, Int)] = Array((Hamlet,1), (by,1))scala> val rdd4=rdd3.reduceByKey(_+_)rdd4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[5] at reduceByKey at <console>:30scala> rdd4.take(3)res3: Array[(String, Int)] = Array((rises.,1), (Let,35), (lug,1))scala> val rdd5=rdd4.map{case(x,y)=>(y,x)}rdd5: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[6] at map at <console>:32scala> rdd5.take(2)res4: Array[(Int, String)] = Array((1,rises.), (35,Let))scala> val rdd6=rdd5.sortByKey(false)rdd6: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[7] at sortByKey at <console>:34scala> rdd6.take(2)res5: Array[(Int, String)] = Array((988,the), (693,and))scala> val rdd7=rdd6.map{case(a,b)=>(b,a)}rdd7: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[8] at map at <console>:36scala> rdd7.take(10)res6: Array[(String, Int)] = Array((the,988), (and,693), (of,621), (to,604), (my,441), (in,387), (HAMLET,378), (you,356), (is,291), (his,277))scala> rdd7.take(10).foreach(println)(the,988)(and,693)(of,621)(to,604)(my,441)(in,387)(HAMLET,378)(you,356)(is,291)(his,277)scala>
(4)编写完整程序
package cn.hadronimport org.apache.spark.SparkConfimport org.apache.spark.SparkContextobject TopK { def main(args: Array[String]): Unit = { if (args.length < 2) { println("Usage:TopK KeyWordsFile K"); System.exit(1) } val conf = new SparkConf().setAppName("TopK Key Words") val sc = new SparkContext(conf) val rdd1 = sc.textFile(args(0)) val result= rdd1.flatMap(x=>x.split(" ")) .filter(_.size>1) .map(x=>(x,1)) .reduceByKey(_+_) .map{case(x,y)=>(y,x)} .sortByKey(false) .map{case(a,b)=>(b,a)} result.take(10).foreach(println) }}
(5)打包与上传
mvn package
(6)提交执行
spark-submit
–master yarn
–deploy-mode client
–class cn.hadron.TopK
/root/simpleSpark-1.0-SNAPSHOT.jar input/Hamlet.txt 10
[root@node1 ~]# spark-submit --master yarn --deploy-mode client --class cn.hadron.TopK /root/simpleSpark-1.0-SNAPSHOT.jar input/Hamlet.txt 1017/09/21 09:48:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable(the,988)(and,693)(of,621)(to,604)(my,441)(in,387)(HAMLET,378)(you,356)(is,291)(his,277)[root@node1 ~]#
阅读全文
0 0
- Spark2.x学习笔记:9、 Spark编程实例
- Spark2.x学习笔记:16、Spark Streaming入门实例NetworkWordCount
- Spark2.x学习笔记:17、Spark Streaming之HdfsWordCount 学习
- Spark2.x学习笔记:3、 Spark核心概念RDD
- Spark2.x学习笔记:5、Spark On YARN模式
- Spark2.x学习笔记:7、Spark应用程序设计
- Spark2.x学习笔记:13、Spark SQL快速入门
- Spark2.x学习笔记:14、Spark SQL程序设计
- Spark2.x学习笔记:15、Spark SQL的SQL
- Spark2.x学习笔记:18、Spark Streaming程序解读
- Spark2.x学习笔记:4、Spark程序架构与运行模式
- Spark2.x学习笔记:6、在Windows平台下搭建Spark开发环境(Intellij IDEA)
- Spark2.x学习笔记:8、 Spark应用程打包与提交
- Spark2.x学习笔记:12、Shuffle机制
- Spark2.x学习笔记:2、Scala简单例子
- Spark2.x学习笔记:10、简易电影受众系统
- Spark2.x学习笔记:1、Spark2.2快速入门(本地模式)
- Spark实例TopN---Spark学习笔记11
- 第二周-项目2
- B. Coin 数学/组合数 2017 ACM-ICPC 亚洲区(西安赛区)网络赛
- git&github 使用git将本地项目push到github
- Ubantu16.04下创建matlab启动快捷键
- 通俗的方式理解RxJS
- Spark2.x学习笔记:9、 Spark编程实例
- pandas read data from files
- CentOS 6.5 yum安装MongoDb
- Spring 动态代理(四)- 动态代理核心类
- 51 nod 1087 1 10 100 1000 (set)
- 编写程序数一下 1到 100 的所有整数中出现多少次数字9
- 动态规划——序列对准
- mybatis中的#和$的区别?
- with as 固化的结果集是否可以使用原表的索引?