SparkCore1

来源：互联网发布：linux压缩成zip 编辑：程序博客网时间：2024/06/11 11:44

小文件的合并：
hive/mr/spark可以进行小文件的合并的
RDD中的关键算子进行合并小文件的。
coalesce
SparkCore的
sparkSQL进行合并的，

SparkCore的核心概念：
Glossary：术语
Application：应用程序是我们自己开发的
the user’s jar should never include hadoop or spark libraries.
胖包:几百兆的包，上传生产环境中，运行过程中进行分发这些jar包的，非常耗资源的。
瘦包:

一个Application一个Driver Program和多个Executors
Driver Program:进程The process running the main function of the application and creating the SparkContext

Cluster Manager:
an external service for acquring resources on the cluster standalone manager,mesos ,yarn
Spark仅仅是作为一个客户端进行提交代码的。
使用YARN模式但是还是会做一件事情的，就是搭建SparkStandalone模式集群的，提交的时候还是–master yarn的方式的，
1.使用了Yarn模式的就不需要进行搭建Spark集群的，就不需要master，worker的，配置master的HA，
2.我们只需要找一台机器部署spark，作为我们提交作业的一个客户端即可的额。
spark只用2台机器的：
yarn-client:Driver是在本地的额，Driver是占用core+memory的，Driver是一个进程运行main方法的，负载很高的额，可以用top命令进行查看的。
yarn-cluster:
Spark的代码是可以不用修改代码就可以跑在不同的运行模式之上的。
Storm是真实时的
SparkStreaming是mini batch的
用SparkStreaming进行ETL操作的在线机器学习的。
可以利用DataFrame/DataSet的API打通所有框架之间的整合的。
Delopy mode:Distinguishes where the driver process runs in cluster mode,the framework launches the driver inside of the cluster in client mode,the submitter launches the driver outside of the cluster.
对于standalone模式来说是跑在worker上面的，
对于Yarn模式来说是跑在NodeManager上面的。
Spark的1-2台机器是部署在哪里的，集群内还是集群外的都是没有问题的。spark是放在集群之内的。
Worker node：standalone则是跑在worker上面的，yarn则是跑在nodeManager上面的。
Executor：a process launched for an application on a worker node. that runs tasks and keeps data in memory or disk storage across them，Each application has its own executors
数据做cache的时候是放在Executor上面的。
Task：a unit of work that will be sent to one executor(被发送的sent)
Executor可以进行启动Application应用程序的，运行task的，缓存数据的，task跑在executor上面的。

job:a parallel computation consisting of multiple tasks that gets spawned in response to a Spark action 遇到一个action算子就会产生一个job的，action算子的save，collect等等的。
you’ll see this term used in the driver logs
SparkApplication中可以有多个job的。

Stage:Each job gets divided into smaller sets of tasks called stages
一个job中有一百个task的，一部分的task被放在一个stage中的额，另外一些task放在另外一个stage中的。
就会涉及到一个shuffle的。宽依赖俩个stage的。

Spark application run as independent sets of processes on a cluster,processes可能是driver也可能是executor的。coordinated by the sparkcontext object in you main program called the driver program
task不是进程的，是线程的。
Driver Program中拥有SparkContext的资源的。
通过Cluster Manager来获取集群中的资源的。
在去节点worker noder节点上关联起来的，worker node可能是standalone的work也可能是yarn中的nodemanager的。
worker Node中启动Executor的，可以cache缓存数据的额，运行task的。

一个JVM一个进程的。每一个应用程序的Executor是不在一起的。不用的sparkApplication应用程序中是不可以进行共享的。
without writing it to an external storage system。第三方的存储系统Alluxio Taycon可以让不同的应用程序Application中的数据可以进行共享的。

agnostic无关的，不可知的。
spark is agnostic to the underlying cluster manager。
spark作业的代码不管是在哪个运行模式上运行是可以不用改变的，local模式，yarn模式的，standalone模式的。
storm本地运行的代码和提交到集群运行的代码是不一样的。
Driver监听和接收从Executor过来的连接的。Driver将作业发送到Executor中去的。
Driver和Executor互联互通的。
Because the driver schedules tasks on the cluster , it should be run close to the worker node。最好是同一个网段里面的。数据本地化的。

每个应用程序都有Driver和Executor的进程的。生命周期与应用程序是一样的。Executor中可以以多线程的方式进行运行task的。不同的应用程序之间的数据是不一可以进行数据共享的。可以借助于第三方的Alluxio进行共享数据的。

Spark是部署在集群里面好还是部署在集群外面好的：
it should be run close to the worker nodes,
Spark的ThriftServer的工作模式是：

Cluster Manager Types：
Standalone方式的集群，改了一个参数都是要更改的，做一些分发的操作的。Standalone集群中的参数是一致的，改了一个必须全部更改的。
Kubernetes上可以进行部署spark的应用程序的。

–master yarn-client
–master yarn
–master yarn –deploy-mode client都是一样的方式的进行提交作业的。

./spark-shell –master local[2]
cd $SPARK_HOME/conf
cp spark-defaults.conf.template spark-defaults.conf
vim spark-defaults.conf中
spark.master local[2]

spark页面4040上的storage中存储：

cache/persist持久化的。
数据在HDFS上的。
val lines=sc.textFile(“file:///home/hadoop/data/test.data”)
lines.count()
触发的操作是由Action算子进行的额，一个Action触发一个job的
一个Application中有多个job的构成的，一个job中有多个stage构成的。

4040页面中的，Input：Bytes and records read from hadoop or from spark storage
点进去：InputSize/Records记录数，行数的
task的不同，数据倾斜。

lines.cache
执行了一个cache的，但是在4040中的storage中没有显示的。
RDD中的cache是lazy的，transformation是lazy的。
遇到Action算子就会触发计算的
RDD Name：文件名称的
Storage level：Memory Deserialized 1.x replicated
cached partitions:2 Input split中只有俩个的。
Size in memory：200B的，原来的只有57B的现在变大了，默认是变大的。

lines.count再次执行一个job的，
input：200B的大小的。是从内存中读取数据的。
Task中的InputSize/Records=====184B(memory）
括号中是有memory的。

DataSource有1T的数据，多次从HDFS上读取数据的话，会有HDFS上的IO消耗的。性能是非常低的。可以做个缓存的功能进行存储数据的。
RDD中的cache数据是放在BlockManager中的。BlockManager是Block块的管理的。
先是从数据源端HDFS上读取的，cache后放入到内存中的。
SparkCore中的Cache是lazy的。
SparkSQL中的Cache是eager的。
interview：cache vs persist

def cache():this.type=persist()
persist this RDD with the default storage level(MEMORY_ONLY)
def persist():this.type=persist(StorageLevel.MEMORY_ONLY)
class StorageLevel private(private var _useDisk:Boolean,private var _useMemory:Boolean,private var _useOffHeap:Boolean,private var _deserialized:Boolean ,private var _replication:Int=1)extends Externalizable{}

deserializabled：一个partition作为一个字节数组的
搜索BlockManager中的level.useMemory
序列化会导致CPU的开销增大的，内存紧张cpu足够

lines.unpersist() 手动进行去缓存的
RDD中的cache是lazy的，
RDD中的unpersist是eager的。

trade-off权衡的。
MEMORY_ONLY_SER

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used LRU fashion,最近使用的
对于例行作业的，spark-submit提交的作业
离线批处理的，sc.stop()是可以干掉的cache缓存中的数据的。
但是对于long service一个SparkContext的，启动ThriftServer服务的。
client通过request的方式到Server上去时候。
rdd.cache()
rdd.transformation
rdd.action
rdd.unpersist进行手工的unpersist的。

作业：tachyon
alluxio
将使用spark+alluxio进行整合一起使用，cache数据到alluxio的。Alluxio是基于内存的文件系统的。部署一个单机版的Alluxio
val s=sc.textFile(“alluxio://localhost:19998/LICENSE”)
val double=s.map(line=>line+line)
double.saveAsTextFile(“alluxio://localhost:19991/LICENSE2”)
alluxio-ft是容错的模式的
测试数据参照page_views.dat

MarkDown:编译Alluxio
安装Alluxio
使用Alluxio，整合MapReduce+Alluxio
整合Spark+Alluxio
提供测试报告测试结果的
MarkDown写到GitHub上的。
ls -lh

阅读全文

0 0