Spark创建WordCount并统计词频

来源:互联网 发布:数据库置疑修复工具 编辑:程序博客网 时间:2024/06/05 06:41

(1)先准备一个名为test.txt的文档,该文档内容如下:

Apple Apple Orange
Banana Grape Grape

(2)上传文档

然后使用secureCRT上传到Linux系统上。上传完毕后,检查文档

zhang@Desktop1:~$ ls | grep 'test.txt'
test.txt

(3)查看内容

zhang@Desktop1:~$ cat test.txt 
Apple Apple Orange
Banana Grape Grape

说明文档已经上传成功了

(4)执行start-dfs.sh启动hadoop

(5)执行spark-shell启动,进入spark交互界面

zhang@Desktop1:~$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/05/10 20:57:25 WARN util.Utils: Your hostname, Desktop1 resolves to a loopback address: 127.0.1.1; using 192.168.8.3 instead (on interface ens33)
17/05/10 20:57:25 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Spark context Web UI available at http://192.168.8.3:4040
Spark context available as 'sc' (master = local[*], app id = local-1494421049820).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_111)
Type in expressions to have them evaluated.
Type :help for more information.


scala>

(6)读取test.txt文本文件

//如果是读取分布式文件系统上的文件,则写sc.textFile("hdfs://......")

scala> val textfile=sc.textFile("file:/home/zhang/test.txt")
17/05/10 20:59:52 WARN util.SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
textfile: org.apache.spark.rdd.RDD[String] = file:/home/zhangchao/test.txt MapPartitionsRDD[1] at textFile at <console>:24


scala>

(7)使用flatMap空格符分隔单词,并读取每个单词

scala> val stringRDD=textfile.flatMap(t=>t.split(" "))
stringRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:26


scala>

(8)通过map reduce计算每一个单词出现的次数

scala> val countsRDD=stringRDD.map(word=>(word,1)).reduceByKey(_ + _)
countsRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:28


scala>

(9)保存计算结果

scala> countsRDD.saveAsTextFile("file:/home/zhang/output")


scala>

(10)退出spark-shell

scala> :q

(11)查看输出结果

zhang@Desktop1:~$ ls
derby.log         log4j-slf4j-impl-2.4.1.jar           test.txt                          公共的  图片  音乐
examples.desktop  mysql-connector-java-5.1.41-bin.jar  VMwareTools-9.6.2-1688356.tar.gz  模板    文档  桌面
filemacsn.txt     output                               vmware-tools-distrib              视频    下载

可以看到在用户主目录下已经存在一个output文件夹,然后cd到该目录下面,并查看有哪些文件。

zhang@Desktop1:~/output$ ls
part-00000  _SUCCESS

其中part-00000保存了输出结果,现在查看输出结果。

zhang@Desktop1:~/output$ cat part-00000 
(Grape,2)
(Orange,1)
(Apple,2)
(Banana,1)

可以看到,该输出结果与test.txt文档中的内容是完全一致的,即:

Grape出现2次

Orange出现1次

Apple出现2次

Banana出现1次

0 0