基于HDFS的spark分布式Scala wordcount程序测试
来源:互联网 发布:淘宝网保障基金 编辑:程序博客网 时间:2024/06/05 14:41
基于HDFS的spark分布式Scala wordcount程序测试
本文是在Hadoop分布式集群和基于HDFS的spark分布式集群部署配置基础上进行Scala程序wordcount测试,环境分别是spark-shell和intelliJ IDEA 。
环境基础是:
节点
地址
HDFS
Yarn
Spark
VM1
196.168.168.11
Namenode
ResourceManager
Master worker
VM2
196.168.168.22
DataNode secondarynamenode
NameManager
Worker
VM3
196.168.168.33
DataNode
NameManager
Worker
启动HDFS服务
[root@vm1 sbin]#start-dfs.sh
Starting namenodes on [vm1]
vm1: namenode running as process3861. Stop it first.
vm2: datanode running as process3656. Stop it first.
vm3: datanode running as process3623. Stop it first.
Starting secondary namenodes [vm2]
vm2: secondarynamenode running asprocess 3739. Stop it first.
启动yarn服务
[root@vm1 sbin]#start-yarn.sh
starting yarn daemons
resourcemanager running as process4152. Stop it first.
vm2: nodemanager running as process3878. Stop it first.
vm3: nodemanager running as process3776. Stop it first.
启动spark中的master和slaves节点服务
[root@vm1 sbin]#start-master.sh
org.apache.spark.deploy.master.Masterrunning as process 4454. Stop it first.
[root@vm1 sbin]#start-slaves.sh
vm1:org.apache.spark.deploy.worker.Worker running as process 4565. Stop it first.
localhost:org.apache.spark.deploy.worker.Worker running as process 4565. Stop it first.
vm3: startingorg.apache.spark.deploy.worker.Worker, logging to/usr/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-vm3.out
vm2: startingorg.apache.spark.deploy.worker.Worker, logging to /usr/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-vm2.out
vm2: failed to launch: nice -n 0/usr/spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port8080 spark://vm1:7077
vm2: full log in/usr/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-vm2.out
vm3: failed to launch: nice -n 0/usr/spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port8080 spark://vm1:7077
vm3: full log in/usr/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-vm3.out
验证Hadoop和spark正常运行
主节点上:
[root@vm1 sbin]#jps
3861 NameNode
4565 Worker
4454 Master
4152 ResourceManager
54025 Jps
其他节点上:
[root@vm2 conf]#jps
3878 NodeManager
3656 DataNode
3739 SecondaryNameNode
57243 Worker
57791 Jps
启动sparkshell进行scala程序编写
[root@vm1 bin]#spark-shell
Setting default log level to"WARN".
To adjust logging level usesc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/31 01:59:01 WARNutil.NativeCodeLoader: Unable to load native-hadoop library for your platform...using builtin-java classes where applicable
17/08/31 01:59:10 WARNmetastore.ObjectStore: Failed to get database global_temp, returningNoSuchObjectException
Spark context Web UI available athttp://196.168.168.11:4040
Spark context available as 'sc'(master = local[*], app id = local-1504115943045).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version2.2.0
/_/
Using Scala version 2.11.8 (JavaHotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have themevaluated.
Type :help for more information.
scala>
[root@vm1 ~]#hdfs dfs -ls /user/hadoop
Found 1 items
-rw-r--r-- 2 root supergroup 3809 2017-08-22 19:56/user/hadoop/README.md
scala> valrdd = sc.textFile("hdfs://vm1:9000/user/hadoop/README.md")
rdd:org.apache.spark.rdd.RDD[String] = hdfs://vm1:9000/user/hadoop/README.mdMapPartitionsRDD[1] at textFile at <console>:24
scala> valwordcount=rdd.flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_+_)
wordcount:org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at<console>:26
scala>wordcount.count()
res0: Long = 287
scala>wordcount.take(10)
res1: Array[(String, Int)] =Array((package,1), (For,3), (Programs,1), (processing.,1), (Because,1),(The,1), (page](http://spark.apache.org/documentation.html).,1), (cluster.,1),(its,1), ([run,1))
intelliJ idea中Scala程序
Wordsort最后结果:
- 基于HDFS的spark分布式Scala wordcount程序测试
- Spark实战----(1)使用Scala开发本地测试的Spark WordCount程序
- 7.Spark Streaming:输入DStream之基础数据源以及基于HDFS的实时wordcount程序
- Spark Run WordCount On Hdfs using Scala
- maven构建Scala程序,实现spark的wordcount
- 基于HDFS的实时计算和wordcount程序
- scala-eclipse 编写spark简单程序 WordCount
- 分别用Java、Scala、spark-shell开发wordcount程序及测试代码
- Spark基于排序机制的wordcount程序(Java版)
- spark streaming 的wordcount程序,从hdfs上读取文件中的内容并计数
- Spark核心编程:使用Java和Scala开发wordcount程序
- 第一个spark scala程序——wordcount
- 使用IDEA编写基于Scala的spark程序中的常见问题
- scala本地wordcount的程序编写
- 基于Java的Spark WordCount编程实现
- scala akka wordcount程序
- Spark wordcount - Python, Scala, Java
- Spark+scala+Idea wordcount 示例
- Java解析Json字符串--复杂对象
- http连接网络
- 干货丨大数据时代电子政务面临的机遇和挑战
- 谈一下cookie和session
- hdu 2064 匈牙利算法
- 基于HDFS的spark分布式Scala wordcount程序测试
- Python编程细节(二)
- 学习上的思考
- CI框架常用代码
- NuPlayer介绍
- jupyter notebook选择conda环境
- Logistic Regression的决策超平面
- Java高级篇(四十)------Java IO深入理解
- Hibernate之对象之间的关联操作(cascade与fetch)