Spark学习一

来源:互联网 发布:win10电脑网络不稳定 编辑:程序博客网 时间:2024/06/03 15:53

Spark学习一

标签(空格分隔): Spark


  • Spark学习一
    • 一概述
    • 二spark的安装
    • 三spark的初步使用
    • 四spark的standalone模式的配置

一,概述

  • 列表项

和mapreduce计算的比较
001.png-11.2kB

001.png-13.5kB

  • what is spark
    Apache Spark™ is a fast and general engine for large-scale data processing.
    1,Speed:Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
    2,Ease of Use:Write applications quickly in Java, Scala, Python,Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.
    3,Generality:Combine SQL, streaming, and complex analytics.
    4,Runs Everywhere:Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3

001.png-43kB

学好spark的路径:
*,http://spark.apache.org/
*,spark源码:https://github.com/apache/spark
*,https://databricks.com/blog

二,spark的安装

  • 解压scala安装包
[hadoop001@xingyunfei001 app]$ chmod u+x scala-2.10.4.tgz[hadoop001@xingyunfei001 app]$ tar -zxf scala-2.10.4.tgz -C /opt/app
  • 修改/etc/profile配置文件
export SCALA_HOME=/opt/app/scala-2.10.4export PATH=$PATH:$SCALA_HOME/bin
[hadoop001@xingyunfei001 app]$ scala -version

001.png-14.2kB

[hadoop001@xingyunfei001 scala-2.10.4]$ source /etc/profile
  • 解压spark安装包
[hadoop001@xingyunfei001 app]$ chmod u+x spark-1.3.0-bin-2.5.tar.gz[hadoop001@xingyunfei001 app]$ tar -zxf spark-1.3.0-bin-2.5.tar.gz -C /opt/app
  • 配置spark的配置文件(spark-env.sh.template—>spark-env.sh)
JAVA_HOME=/opt/app/jdk1.7.0_67SCALA_HOME=/opt/app/scala-2.10.4HADOOP_CONF_DIR=/opt/app/hadoop_2.5.0_cdh
  • 启动spark
bin/spark-shell

001.png-46.7kB

三,spark的初步使用

  • 第一个案例
var rdd=sc.textFile("/opt/datas/beifeng.log")

001.png-46.2kB

rdd.count    //显示总条数

001.png-146.5kB

rdd.first  //显示第一条数据

001.png-130.9kB

rdd.take(2)  //获取头2条数据

001.png-132.8kB

rdd.filter(x=>x.contains("yarn")).collectrdd.filter(_.contains("yarn")).collect

001.png-111.3kB

rdd.cache     //将数据放到内存中rdd.count

001.png-24.9kB

rdd.flatMap(x=>x.split(" ")).map(x=>(x,1)).reduceByKey((x,y)=>(x+y)).collect

001.png-24.7kB

四,spark的standalone模式的配置

001.png-27.4kB

sparkcpntext:
1,application申请资源
2,读取数据,创建rdd

  • 修改配置文件spark-env.sh
SPARK_MASTER_IP=xingyunfei001.com.cnSPARK_MASTER_PORT=7077SPARK_MASTER_WEBUI_PORT=8080SPARK_WORKER_CORES=2SPARK_WORKER_MEMORY=2gSPARK_WORKER_PORT=7078SPARK_WORKER_WEBUI_PORT=8081SPARK_WORKER_INSTANCES=1
  • 修改配置文件slaves
# A Spark Worker will be started on each of the machines listed below.xingyunfei001.com.cn
  • 启动standalone模式
[hadoop001@xingyunfei001 spark-1.3.0-bin-2.5.0]$ sbin/start-master.sh 

001.png-24.6kB

[hadoop001@xingyunfei001 spark-1.3.0-bin-2.5.0]$ sbin/start-slaves.sh

001.png-20.3kB

001.png-52.3kB

  • 提交应用
bin/spark-shell --master spark://xingyunfei001.com.cn:7070var rdd=sc.textFile("/opt/datas/input.txt")val wordcount=rdd.flatMap(x=>x.split(" ")).map(x=>(x,1)).reduceByKey((x,y)=>(x+y)).collectsc.stop

001.png-26.9kB
001.png-50.1kB

[hadoop001@xingyunfei001 spark-1.3.0-bin-2.5.0]$ bin/spark-shell local[2]  //本地模式启动2个线程[hadoop001@xingyunfei001 spark-1.3.0-bin-2.5.0]$ bin/spark-shell local[*]  //根据本地配置自动设置线程数目
  • web监控界面:http://xingyunfei001.com.cn:4040/jobs/
    001.png-25.7kB
0 0
原创粉丝点击