pyspark-RDD API
来源:互联网 发布:图片数字化软件 编辑:程序博客网 时间:2024/06/01 10:31
参考:
1、http://spark.apache.org/docs/latest/quick-start.html
2、https://github.com/mahmoudparsian/pyspark-tutorial
3、https://github.com/jkthompson/pyspark-pictures
4、http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD
安装参考:http://blog.csdn.net/wc781708249/article/details/78223371
1、启动spark
执行 /home/wu/down/spark/sbin/start-all.sh
如果出现 root@localhost's password:
没有设置密码或者忘记了,
执行:passwd root # 重新设置密码
再执行 ssh root@localhost 输入刚设置的密码 如果出现 Permission denied, please try again.
vim /etc/ssh/sshd_config
- Setting PasswordAuthenticationyes
- Setting RSAAuthenicationyes
- Setting PubkeyAuthentication yes
- SettingPermitRootLoginyes
PermitRootLogin prohibit-password --->修改成 PermitRootLogin yes
在执行 ssh root@localhost 如果成功说明没问题了
执行/home/wu/down/spark/sbin/start-all.sh # 启动spark
对应的/home/wu/down/spark/sbin/stop-all.sh # 停止spark
2、上传文件
vim person.json
输入:
{"name":"Michael"}{"name":"Andy", "age":30}{"name":"Justin", "age":19}如果安装了hadoop,(参考集群版安装)
执行:
hdfs dfs -mkdir /data
hdfs dfs -put person.json /data
读取hdfs上的数据
df = spark.read.json("hdfs://localhost:9000/data/person.json")
补充:数据读取
sc.pickleFile() # <class 'pyspark.rdd.RDD'>sc.textFile() # <class 'pyspark.rdd.RDD'>spark.read.json() # <class 'pyspark.sql.dataframe.DataFrame'>spark.read.text() # <class 'pyspark.sql.dataframe.DataFrame'>读取本地数据
sc.pickleFile("file:///home/mparsian/dna_seq.txt")
sc.pickleFile("home/mparsian/dna_seq.txt")
3、RDD API
map
# mapx = sc.parallelize([1,2,3]) # sc = spark context, parallelize creates an RDD from the passed objecty = x.map(lambda x: (x,x**2))print(x.collect()) # collect copies RDD elements to a list on the driverprint(y.collect())[1, 2, 3][(1, 1), (2, 4), (3, 9)]
Return a new RDD by applying a function to each element of this RDD.
flatmap
# flatMapx = sc.parallelize([1,2,3])y = x.flatMap(lambda x: (x, 100*x, x**2))print(x.collect())print(y.collect())[1, 2, 3][1, 100, 1, 2, 200, 4, 3, 300, 9]
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
mapPartitions
# mapPartitionsx = sc.parallelize([1,2,3], 2)def f(iterator): yield sum(iterator)y = x.mapPartitions(f)print(x.glom().collect()) # glom() flattens elements on the same partitionprint(y.glom().collect())[[1], [2, 3]][[1], [5]]
Return a new RDD by applying a function to each partition of this RDD.
>>> rdd = sc.parallelize([1, 2, 3, 4], 2) # [[1,2],[3,4]]>>> def f(iterator): yield sum(iterator)... >>> rdd.mapPartitions(f).collect()[3, 7]>>> rdd.collect()[1, 2, 3, 4]>>> rdd = sc.parallelize([1, 2, 3, 4], 4) # [[1],[2],[3],[4]]>>> rdd.mapPartitions(f).collect()[1, 2, 3, 4]
mapPartitionsWithIndex
# mapPartitionsWithIndexx = sc.parallelize([1,2,3], 2)def f(partitionIndex, iterator): yield (partitionIndex,sum(iterator))y = x.mapPartitionsWithIndex(f)print(x.glom().collect()) # glom() flattens elements on the same partitionprint(y.glom().collect())[[1], [2, 3]][[(0, 1)], [(1, 5)]]
Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.
>>> rdd=sc.parallelize([1,2,3,4],4)>>>deff(splitIndex,iterator):yieldsplitIndex>>>rdd.mapPartitionsWithIndex(f).sum()6
>>> rdd = sc.parallelize([1, 2, 3, 4], 4)>>> def f(splitIndex, iterator): yield splitIndex ... >>> rdd.mapPartitionsWithIndex(f).sum()6>>> rdd.collect()[1, 2, 3, 4]>>>
getNumPartitions
# getNumPartitionsx = sc.parallelize([1,2,3], 2)y = x.getNumPartitions()print(x.glom().collect())print(y)[[1], [2, 3]]2
Returns the number of partitions in RDD
>>> rdd = sc.parallelize([1, 2, 3, 4], 2)>>> rdd.getNumPartitions()2>>> rdd = sc.parallelize([1, 2, 3, 4], 3)>>> rdd.getNumPartitions()3>>> rdd = sc.parallelize([1, 2, 3, 4], 4)>>> rdd.getNumPartitions()4
filter
# filterx = sc.parallelize([1,2,3])y = x.filter(lambda x: x%2 == 1) # filters out even elementsprint(x.collect())print(y.collect())[1, 2, 3][1, 3]
Return a new RDD containing only the elements that satisfy a predicate.
distinct
# distinctx = sc.parallelize(['A','A','B'])y = x.distinct()print(x.collect())print(y.collect())['A', 'A', 'B']['A', 'B']
Return a new RDD containing the distinct elements in this RDD.
sample
# samplex = sc.parallelize(range(7))ylist = [x.sample(withReplacement=False, fraction=0.5) for i in range(5)] # call 'sample' 5 timesprint('x = ' + str(x.collect()))for cnt,y in zip(range(len(ylist)), ylist): print('sample:' + str(cnt) + ' y = ' + str(y.collect()))x = [0, 1, 2, 3, 4, 5, 6]sample:0 y = [0, 6]sample:1 y = [4]sample:2 y = [1, 2, 3]sample:3 y = [2, 3, 5, 6]sample:4 y = [1, 2]
Return a sampled subset of this RDD (relies on numpy and falls back on default random generator if numpy is unavailable).
takeSample
# takeSamplex = sc.parallelize(range(7))ylist = [x.takeSample(withReplacement=False, num=3) for i in range(5)] # call 'sample' 5 timesprint('x = ' + str(x.collect()))for cnt,y in zip(range(len(ylist)), ylist): print('sample:' + str(cnt) + ' y = ' + str(y)) # no collect on yx = [0, 1, 2, 3, 4, 5, 6]sample:0 y = [5, 4, 3]sample:1 y = [4, 0, 2]sample:2 y = [1, 2, 4]sample:3 y = [5, 6, 0]sample:4 y = [3, 1, 6]
Return a fixed-size sampled subset of this RDD (currently requires numpy).
union
# unionx = sc.parallelize(['A','A','B'])y = sc.parallelize(['D','C','A'])z = x.union(y)print(x.collect())print(y.collect())print(z.collect())['A', 'A', 'B']['D', 'C', 'A']['A', 'A', 'B', 'D', 'C', 'A']
Return the union of this RDD and another one.
intersection
# intersectionx = sc.parallelize(['A','A','B'])y = sc.parallelize(['A','C','D'])z = x.intersection(y)print(x.collect())print(y.collect())print(z.collect())['A', 'A', 'B']['A', 'C', 'D']['A']
Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.
Note that this method performs a shuffle internally.
sortByKey
# sortByKeyx = sc.parallelize([('B',1),('A',2),('C',3)])y = x.sortByKey()print(x.collect())print(y.collect())[('B', 1), ('A', 2), ('C', 3)][('A', 2), ('B', 1), ('C', 3)]
Sorts this RDD, which is assumed to consist of (key, value) pairs. # noqa
sortBy
# sortByx = sc.parallelize(['Cat','Apple','Bat'])def keyGen(val): return val[0]y = x.sortBy(keyGen)print(y.collect())['Apple', 'Bat', 'Cat']
Sorts this RDD by the given keyfunc
glom
# glomx = sc.parallelize(['C','B','A'], 2)y = x.glom()print(x.collect())print(y.collect())['C', 'B', 'A'][['C'], ['B', 'A']]
Return an RDD created by coalescing all elements within each partition into a list.
>>> rdd = sc.parallelize([1, 2, 3, 4], 2)>>> sorted(rdd.glom().collect())[[1, 2], [3, 4]]>>> rdd = sc.parallelize([1, 2, 3, 4], 3)>>> sorted(rdd.glom().collect())[[1], [2], [3, 4]]>>> rdd = sc.parallelize([1, 2, 3, 4], 4)>>> sorted(rdd.glom().collect())[[1], [2], [3], [4]]>>>
cartesian
# cartesianx = sc.parallelize(['A','B'])y = sc.parallelize(['C','D'])z = x.cartesian(y)print(x.collect())print(y.collect())print(z.collect())['A', 'B']['C', 'D'][('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D')]
Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other.
groupBy
# groupByx = sc.parallelize([1,2,3])y = x.groupBy(lambda x: 'A' if (x%2 == 1) else 'B' )print(x.collect())print([(j[0],[i for i in j[1]]) for j in y.collect()]) # y is nested, this iterates through it[1, 2, 3][('A', [1, 3]), ('B', [2])]
Return an RDD of grouped items.
pipe
# pipex = sc.parallelize(['A', 'Ba', 'C', 'AD'])y = x.pipe('grep -i "A"') # calls out to grep, may fail under Windowsprint(x.collect())print(y.collect())['A', 'Ba', 'C', 'AD'][u'A', u'Ba', u'AD']
Return an RDD created by piping elements to a forked external process.
foreach
# foreachfrom __future__ import print_functionx = sc.parallelize([1,2,3])def f(el): '''side effect: append the current RDD elements to a file''' f1=open("./foreachExample.txt", 'a+') print(el,file=f1)open('./foreachExample.txt', 'w').close() # first clear the file contentsy = x.foreach(f) # writes into foreachExample.txtprint(x.collect())print(y) # foreach returns 'None'# print the contents of foreachExample.txtwith open("./foreachExample.txt", "r") as foreachExample: print (foreachExample.read())[1, 2, 3]None132
Applies a function to all elements of this RDD.
foreachPartition
# foreachPartitionfrom __future__ import print_functionx = sc.parallelize([1,2,3],5)def f(parition): '''side effect: append the current RDD partition contents to a file''' f1=open("./foreachPartitionExample.txt", 'a+') print([el for el in parition],file=f1)open('./foreachPartitionExample.txt', 'w').close() # first clear the file contentsy = x.foreachPartition(f) # writes into foreachExample.txtprint(x.glom().collect())print(y) # foreach returns 'None'# print the contents of foreachExample.txtwith open("./foreachPartitionExample.txt", "r") as foreachExample: print (foreachExample.read())[[], [1], [], [2], [3]]None[][1][][2][3]
Applies a function to each partition of this RDD.
collect
# collectx = sc.parallelize([1,2,3])y = x.collect()print(x) # distributedprint(y) # not distributedParallelCollectionRDD[84] at parallelize at PythonRDD.scala:423[1, 2, 3]
Return a list that contains all of the elements in this RDD.
reduce
# reducex = sc.parallelize([1,2,3])y = x.reduce(lambda obj, accumulated: obj + accumulated) # computes a cumulative sumprint(x.collect())print(y)[1, 2, 3]6
Reduces the elements of this RDD using the specified commutative and associative binary operator. Currently reduces partitions locally.
fold
# foldx = sc.parallelize([1,2,3])neutral_zero_value = 0 # 0 for sum, 1 for multiplicationy = x.fold(neutral_zero_value,lambda obj, accumulated: accumulated + obj) # computes cumulative sumprint(x.collect())print(y)[1, 2, 3]6
Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral “zero value.”
The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.
aggregate
# aggregatex = sc.parallelize([2,3,4])neutral_zero_value = (0,1) # sum: x+0 = x, product: 1*x = xseqOp = (lambda aggregated, el: (aggregated[0] + el, aggregated[1] * el))combOp = (lambda aggregated, el: (aggregated[0] + el[0], aggregated[1] * el[1]))y = x.aggregate(neutral_zero_value,seqOp,combOp) # computes (cumulative sum, cumulative product)print(x.collect())print(y)[2, 3, 4](9, 24)
Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral “zero value.”
The functions op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.
The first function (seqOp) can return a different result type, U, than the type of this RDD. Thus, we need one operation for merging a T into an U and one operation for merging two U
max
# maxx = sc.parallelize([1,3,2])y = x.max()print(x.collect())print(y)[1, 3, 2]3
Find the maximum item in this RDD.
min
# minx = sc.parallelize([1,3,2])y = x.min()print(x.collect())print(y)[1, 3, 2]1
Find the minimum item in this RDD.
sum
# sumx = sc.parallelize([1,3,2])y = x.sum()print(x.collect())print(y)[1, 3, 2]6
Add up the elements in this RDD.
count
# countx = sc.parallelize([1,3,2])y = x.count()print(x.collect())print(y)
Return the number of elements in this RDD.
>>> sc.parallelize([2,3,4]).count()3
histogram
# histogram (example #1)x = sc.parallelize([1,3,1,2,3])y = x.histogram(buckets = 2)print(x.collect())print(y)[1, 3, 1, 2, 3]([1, 2, 3], [2, 3])# histogram (example #2)x = sc.parallelize([1,3,1,2,3])y = x.histogram([0,0.5,1,1.5,2,2.5,3,3.5])print(x.collect())print(y)[1, 3, 1, 2, 3]([0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5], [0, 0, 2, 0, 1, 0, 2])
Compute a histogram using the provided buckets. The buckets are all open to the right except for the last which is closed. e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50], which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1 and 50 we would have a histogram of 1,0,1.
If your histogram is evenly spaced (e.g. [0, 10, 20, 30]), this can be switched from an O(log n) inseration to O(1) per element(where n = # buckets).
Buckets must be sorted and not contain any duplicates, must be at least two elements.
If buckets is a number, it will generates buckets which are evenly spaced between the minimum and maximum of the RDD. For example, if the min value is 0 and the max is 100, given buckets as 2, the resulting buckets will be [0,50) [50,100]. buckets must be at least 1 If the RDD contains infinity, NaN throws an exception If the elements in RDD do not vary (max == min) always returns a single bucket.
It will return an tuple of buckets and histogram.
mean
# meanx = sc.parallelize([1,3,2])y = x.mean()print(x.collect())print(y)[1, 3, 2]2.0
Compute the mean of this RDD’s elements.
variance
# variancex = sc.parallelize([1,3,2])y = x.variance() # divides by Nprint(x.collect())print(y)[1, 3, 2]0.666666666667
Compute the variance of this RDD’s elements.
stdev
# stdevx = sc.parallelize([1,3,2])y = x.stdev() # divides by Nprint(x.collect())print(y)[1, 3, 2]0.816496580928
Compute the standard deviation of this RDD’s elements.
sampleStdev
# sampleStdevx = sc.parallelize([1,3,2])y = x.sampleStdev() # divides by N-1print(x.collect())print(y)[1, 3, 2]1.0
Compute the sample standard deviation of this RDD’s elements (which corrects for bias in estimating the standard deviation by dividing by N-1 instead of N).
sampleVariance
# sampleVariancex = sc.parallelize([1,3,2])y = x.sampleVariance() # divides by N-1print(x.collect())print(y)[1, 3, 2]1.0
Compute the sample variance of this RDD’s elements (which corrects for bias in estimating the variance by dividing by N-1 instead of N).
countByValue
# countByValuex = sc.parallelize([1,3,1,2,3])y = x.countByValue()print(x.collect())print(y)[1, 3, 1, 2, 3]defaultdict(<type 'int'>, {1: 2, 2: 1, 3: 2})
Return the count of each unique value in this RDD as a dictionary of (value, count) pairs.
top
# topx = sc.parallelize([1,3,1,2,3])y = x.top(num = 3)print(x.collect())print(y)[1, 3, 1, 2, 3][3, 3, 2]
Get the top N elements from a RDD.
Note: It returns the list sorted in descending order.
takeOrdered
# takeOrderedx = sc.parallelize([1,3,1,2,3])y = x.takeOrdered(num = 3)print(x.collect())print(y)[1, 3, 1, 2, 3][1, 1, 2]
Get the N elements from a RDD ordered in ascending order or as specified by the optional key function.
take
# takex = sc.parallelize([1,3,1,2,3])y = x.take(num = 3)print(x.collect())print(y)[1, 3, 1, 2, 3][1, 3, 1]
Take the first num elements of the RDD.
It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.
Translated from the Scala implementation in RDD#take().
first
# firstx = sc.parallelize([1,3,1,2,3])y = x.first()print(x.collect())print(y)[1, 3, 1, 2, 3]1
Return the first element in this RDD.
collectAsMap
# collectAsMapx = sc.parallelize([('C',3),('A',1),('B',2)])y = x.collectAsMap()print(x.collect())print(y)[('C', 3), ('A', 1), ('B', 2)]{'A': 1, 'C': 3, 'B': 2}
Return the key-value pairs in this RDD to the master as a dictionary.
keys
# keysx = sc.parallelize([('C',3),('A',1),('B',2)])y = x.keys()print(x.collect())print(y.collect())[('C', 3), ('A', 1), ('B', 2)]['C', 'A', 'B']
Return an RDD with the keys of each tuple.
values
# valuesx = sc.parallelize([('C',3),('A',1),('B',2)])y = x.values()print(x.collect())print(y.collect())[('C', 3), ('A', 1), ('B', 2)][3, 1, 2]
Return an RDD with the values of each tuple.
reduceByKey
# reduceByKeyx = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])y = x.reduceByKey(lambda agg, obj: agg + obj)print(x.collect())print(y.collect())[('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)][('A', 12), ('B', 3)]
Merge the values for each key using an associative reduce function.
This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce.
Output will be hash-partitioned with numPartitions partitions, or the default parallelism level if numPartitions is not specified.
reduceByKeyLocally
# reduceByKeyLocallyx = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])y = x.reduceByKeyLocally(lambda agg, obj: agg + obj)print(x.collect())print(y)[('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)]{'A': 12, 'B': 3}
Merge the values for each key using an associative reduce function, but return the results immediately to the master as a dictionary.
This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce.
countByKey
# countByKeyx = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])y = x.countByKey()print(x.collect())print(y)[('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)]defaultdict(<type 'int'>, {'A': 3, 'B': 2})
Count the number of elements for each key, and return the result to the master as a dictionary.
join
# joinx = sc.parallelize([('C',4),('B',3),('A',2),('A',1)])y = sc.parallelize([('A',8),('B',7),('A',6),('D',5)])z = x.join(y)print(x.collect())print(y.collect())print(z.collect())[('C', 4), ('B', 3), ('A', 2), ('A', 1)][('A', 8), ('B', 7), ('A', 6), ('D', 5)][('A', (2, 8)), ('A', (2, 6)), ('A', (1, 8)), ('A', (1, 6)), ('B', (3, 7))]
Return an RDD containing all pairs of elements with matching keys in self and other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other.
Performs a hash join across the cluster.
leftOuterJoin
# leftOuterJoinx = sc.parallelize([('C',4),('B',3),('A',2),('A',1)])y = sc.parallelize([('A',8),('B',7),('A',6),('D',5)])z = x.leftOuterJoin(y)print(x.collect())print(y.collect())print(z.collect())[('C', 4), ('B', 3), ('A', 2), ('A', 1)][('A', 8), ('B', 7), ('A', 6), ('D', 5)][('A', (2, 8)), ('A', (2, 6)), ('A', (1, 8)), ('A', (1, 6)), ('C', (4, None)), ('B', (3, 7))]
Perform a left outer join of self and other.
For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k.
Hash-partitions the resulting RDD into the given number of partitions.
rightOuterJoin
# rightOuterJoinx = sc.parallelize([('C',4),('B',3),('A',2),('A',1)])y = sc.parallelize([('A',8),('B',7),('A',6),('D',5)])z = x.rightOuterJoin(y)print(x.collect())print(y.collect())print(z.collect())[('C', 4), ('B', 3), ('A', 2), ('A', 1)][('A', 8), ('B', 7), ('A', 6), ('D', 5)][('A', (2, 8)), ('A', (2, 6)), ('A', (1, 8)), ('A', (1, 6)), ('B', (3, 7)), ('D', (None, 5))]
Perform a right outer join of self and other.
For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (v, w)) for v in this, or the pair (k, (None, w)) if no elements in self have key k.
Hash-partitions the resulting RDD into the given number of partitions.
partitionBy
# partitionByx = sc.parallelize([(0,1),(1,2),(2,3)],2)y = x.partitionBy(numPartitions = 3, partitionFunc = lambda x: x) # only key is passed to paritionFuncprint(x.glom().collect())print(y.glom().collect())[[(0, 1)], [(1, 2), (2, 3)]][[(0, 1)], [(1, 2)], [(2, 3)]]
Return a copy of the RDD partitioned using the specified partitioner.
combineByKey
# combineByKeyx = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])createCombiner = (lambda el: [(el,el**2)])mergeVal = (lambda aggregated, el: aggregated + [(el,el**2)]) # append to aggregatedmergeComb = (lambda agg1,agg2: agg1 + agg2 ) # append agg1 with agg2y = x.combineByKey(createCombiner,mergeVal,mergeComb)print(x.collect())print(y.collect())[('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)][('A', [(3, 9), (4, 16), (5, 25)]), ('B', [(1, 1), (2, 4)])]
Generic function to combine the elements for each key using a custom set of aggregation functions.
Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a “combined type” C. Note that V and C can be different – for example, one might group an RDD of type (Int, Int) into an RDD of type (Int, List[Int]).
Users provide three functions:
- createCombiner, which turns a V into a C (e.g., creates a one-element list)
- mergeValue, to merge a V into a C (e.g., adds it to the end of a list)
- mergeCombiners, to combine two C’s into a single one.
In addition, users can control the partitioning of the output RDD.
aggregateByKey
# aggregateByKeyx = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])zeroValue = [] # empty list is 'zero value' for append operationmergeVal = (lambda aggregated, el: aggregated + [(el,el**2)])mergeComb = (lambda agg1,agg2: agg1 + agg2 )y = x.aggregateByKey(zeroValue,mergeVal,mergeComb)print(x.collect())print(y.collect())[('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)][('A', [(3, 9), (4, 16), (5, 25)]), ('B', [(1, 1), (2, 4)])]
Aggregate the values of each key, using given combine functions and a neutral “zero value”. This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U’s, The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.
foldByKey
# foldByKeyx = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])zeroValue = 1 # one is 'zero value' for multiplicationy = x.foldByKey(zeroValue,lambda agg,x: agg*x ) # computes cumulative product within each keyprint(x.collect())print(y.collect())[('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)][('A', 60), ('B', 2)]
Merge the values for each key using an associative function “func” and a neutral “zeroValue” which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.).
groupByKey
# groupByKeyx = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1)])y = x.groupByKey()print(x.collect())print([(j[0],[i for i in j[1]]) for j in y.collect()])[('B', 5), ('B', 4), ('A', 3), ('A', 2), ('A', 1)][('A', [3, 2, 1]), ('B', [5, 4])]
Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with into numPartitions partitions.
Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey will provide much better performance.
flatMapValues
# flatMapValuesx = sc.parallelize([('A',(1,2,3)),('B',(4,5))])y = x.flatMapValues(lambda x: [i**2 for i in x]) # function is applied to entire value, then result is flattenedprint(x.collect())print(y.collect())[('A', (1, 2, 3)), ('B', (4, 5))][('A', 1), ('A', 4), ('A', 9), ('B', 16), ('B', 25)]
Pass each value in the key-value pair RDD through a flatMap function without changing the keys; this also retains the original RDD’s partitioning.
mapValues
# mapValuesx = sc.parallelize([('A',(1,2,3)),('B',(4,5))])y = x.mapValues(lambda x: [i**2 for i in x]) # function is applied to entire valueprint(x.collect())print(y.collect())[('A', (1, 2, 3)), ('B', (4, 5))][('A', [1, 4, 9]), ('B', [16, 25])]
Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD’s partitioning.
groupWith
# groupWithx = sc.parallelize([('C',4),('B',(3,3)),('A',2),('A',(1,1))])y = sc.parallelize([('B',(7,7)),('A',6),('D',(5,5))])z = sc.parallelize([('D',9),('B',(8,8))])a = x.groupWith(y,z)print(x.collect())print(y.collect())print(z.collect())print("Result:")for key,val in list(a.collect()): print(key, [list(i) for i in val])[('C', 4), ('B', (3, 3)), ('A', 2), ('A', (1, 1))][('B', (7, 7)), ('A', 6), ('D', (5, 5))][('D', 9), ('B', (8, 8))]Result:D [[], [(5, 5)], [9]]C [[4], [], []]B [[(3, 3)], [(7, 7)], [(8, 8)]]A [[2, (1, 1)], [6], []]
Alias for cogroup but with support for multiple RDDs.
cogroup
# cogroupx = sc.parallelize([('C',4),('B',(3,3)),('A',2),('A',(1,1))])y = sc.parallelize([('A',8),('B',7),('A',6),('D',(5,5))])z = x.cogroup(y)print(x.collect())print(y.collect())for key,val in list(z.collect()): print(key, [list(i) for i in val])[('C', 4), ('B', (3, 3)), ('A', 2), ('A', (1, 1))][('A', 8), ('B', 7), ('A', 6), ('D', (5, 5))]A [[2, (1, 1)], [8, 6]]C [[4], []]B [[(3, 3)], [7]]D [[], [(5, 5)]]
For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in self as well as other.
sampleByKey
# sampleByKeyx = sc.parallelize([('A',1),('B',2),('C',3),('B',4),('A',5)])y = x.sampleByKey(withReplacement=False, fractions={'A':0.5, 'B':1, 'C':0.2})print(x.collect())print(y.collect())[('A', 1), ('B', 2), ('C', 3), ('B', 4), ('A', 5)][('A', 1), ('B', 2), ('B', 4)]
Return a subset of this RDD sampled by key (via stratified sampling). Create a sample of this RDD using variable sampling rates for different keys as specified by fractions, a key to sampling rate map.
subtractByKey
# subtractByKeyx = sc.parallelize([('C',1),('B',2),('A',3),('A',4)])y = sc.parallelize([('A',5),('D',6),('A',7),('D',8)])z = x.subtractByKey(y)print(x.collect())print(y.collect())print(z.collect())[('C', 1), ('B', 2), ('A', 3), ('A', 4)][('A', 5), ('D', 6), ('A', 7), ('D', 8)][('C', 1), ('B', 2)]
Return each (key, value) pair in self that has no pair with matching key in other.
subtract
# subtractx = sc.parallelize([('C',4),('B',3),('A',2),('A',1)])y = sc.parallelize([('C',8),('A',2),('D',1)])z = x.subtract(y)print(x.collect())print(y.collect())print(z.collect())[('C', 4), ('B', 3), ('A', 2), ('A', 1)][('C', 8), ('A', 2), ('D', 1)][('A', 1), ('C', 4), ('B', 3)]
Return each value in self that is not contained in other.
keyBy
# keyByx = sc.parallelize([1,2,3])y = x.keyBy(lambda x: x**2)print(x.collect())print(y.collect())[1, 2, 3][(1, 1), (4, 2), (9, 3)]
Creates tuples of the elements in this RDD by applying f.
repartition
# repartitionx = sc.parallelize([1,2,3,4,5],2)y = x.repartition(numPartitions=3)print(x.glom().collect())print(y.glom().collect())[[1, 2], [3, 4, 5]][[], [1, 2, 3, 4], [5]]
Return a new RDD that has exactly numPartitions partitions.
Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.
coalesce
# coalescex = sc.parallelize([1,2,3,4,5],2)y = x.coalesce(numPartitions=1)print(x.glom().collect())print(y.glom().collect())[[1, 2], [3, 4, 5]][[1, 2, 3, 4, 5]]
Return a new RDD that is reduced into numPartitions partitions.
zip
# zipx = sc.parallelize(['B','A','A'])y = x.map(lambda x: ord(x)) # zip expects x and y to have same #partitions and #elements/partitionz = x.zip(y)print(x.collect())print(y.collect())print(z.collect())['B', 'A', 'A'][66, 65, 65][('B', 66), ('A', 65), ('A', 65)]
Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).
zipWithIndex
# zipWithIndexx = sc.parallelize(['B','A','A'],2)y = x.zipWithIndex()print(x.glom().collect())print(y.collect())[['B'], ['A', 'A']][('B', 0), ('A', 1), ('A', 2)]
Zips this RDD with its element indices.
The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index.
This method needs to trigger a spark job when this RDD contains more than one partitions.
zipWithUniqueId
# zipWithUniqueIdx = sc.parallelize(['B','A','A'],2)y = x.zipWithUniqueId()print(x.glom().collect())print(y.collect())[['B'], ['A', 'A']][('B', 0), ('A', 1), ('A', 3)]
Zips this RDD with generated unique Long ids.
Items in the kth partition will get ids k, n+k, 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method won’t trigger a spark job, which is different from zipWithIndex
使用py脚本
#!/usr/bin/python# -*- coding: UTF-8 -*-from pyspark.context import SparkContextfrom pyspark.conf import SparkConf#from pyspark.sql import DataFrame,SQLContextsc = SparkContext(conf=SparkConf().setAppName("The first example"))# mapx = sc.parallelize([1,2,3]) # sc = spark context, parallelize creates an RDD from the passed objecty = x.map(lambda x: (x,x**2))print(x.collect()) # collect copies RDD elements to a list on the driverprint(y.collect())执行:spark/bin/spark-submit test.py
- pyspark-RDD API
- pyspark-RDD
- pyspark的RDD运算
- pySpark(一)--创建RDD
- Spark/pyspark RDD 笛卡尔积
- pyspark-DataFrame API
- pyspark RDD 自定义排序(python)
- pyspark rdd def partitionBy自定义partitionFunc
- pyspark DecisionTreeModel不能在RDD上直接使用
- 【机器学习】pyspark中RDD的若干操作
- RDD api整理
- spark RDD API详解
- Spark-RDD API
- spark rdd api
- spark rdd api
- Spark RDD API详解
- spark-rdd-api
- Spark RDD API 详解
- 二维码的生成和读取
- 《java高并发程序设计》读书笔记(1)
- eclipse里怎么把tab键换成空格
- 初探Redis
- Android中selector颜色选中设置失败的问题
- pyspark-RDD API
- markdown 字体语法
- 多平台一个微信公众号的openid授权获取
- [NOIP2017模拟]拆墙
- 待审核状态测试
- ovirt python sdk query execution failed due to insufficient permissions
- 【报错】 loading.dismiss():Uncaught (in promise): removeView was not found
- Oracle物化视图
- 数据库分区表