Spark:RDD及其简单操作

来源:互联网 发布:angularjs.js 下载 编辑:程序博客网 时间:2024/06/05 10:48

RDD:Spark对数据的核心抽象--弹性分布式数据集(Resilient Distributed Dataset)。RDD其实就是分布式的元素集合。在Spark中,对数据的所有操作不外乎创建RDD、转化已有RDD、以及调用RDD操作进行求值。而在这一切背后,Spark会自动将RDD中的数据分发到集群上,并将操作并行化执行。


Spark中的RDD就是一个不可变的分布式对象集合。每个RDD都被分为多个分区,这些分区运行在集群中的不同节点上。



#coding:utf-8"""SimpleApp"""from __future__ import divisionfrom pyspark import SparkContextlogFile = "/usr/pro/spark-2.0.0-bin-hadoop2.7/README.md"sc  = SparkContext("local","Simple App")     #python 初始化 sparklogData = sc.textFile(logFile).cache()numAs = logData.filter(lambda s:'a' in s).count()numBs = logData.filter(lambda s:'b' in s).count()print ("Lines with a : %i, lines with b : %i"%(numAs,numBs))# conf = SparkConf().setMaster("local").setAppName("My App")   #python 初始化 spark# sc = SparkContext(conf = conf)from __future__ import divisioninputRDD = sc.textFile("/usr/pro/projects/vminst.log")      #创建RDDerrorRDD = inputRDD.filter(lambda x:"ERROR" in x)warningRDD = inputRDD.filter(lambda x:"WARNING" in x)badLinesRDD = errorRDD.union(warningRDD)# print errorRDD.first()#使用行动操作对错误进行计数print "Input had " + str(badLinesRDD.count()) + " concerning lines"print "There are 10 examples:"for line in badLinesRDD.take(10):    print line#计算RDD中各值的平方nums = sc.parallelize([1,2,3,4,5])squard = nums.map(lambda x:x * x).collect()for num in squard:    print "%i " %(num)# flatmap()将行数据切分为单词from pyspark import StorageLevellines =  sc.parallelize(["hello world","hi"])lines.persist(StorageLevel.MEMORY_ONLY)    #持久化words = lines.flatMap(lambda line:line.split(" ")).collect()words2 = lines.map(lambda line:line.split(" ")).collect()for word in words:    print word    # hello    # world    # hifor word in words2:    print word    # ['hello', 'world']    # ['hi']sumcount = nums.aggregate((0,0),                                               #计算的初始值                          (lambda acc,value:(acc[0]+value,acc[1]+1)),          #分区内求和,前面是和,后面是个数                          (lambda acc1,acc2:(acc1[0] + acc2[0],acc1[1] + acc2[1])))   #分区间求和print sumcount[0]/sumcount[1]


原创粉丝点击