在spark下用pyhton写worldCount

来源：互联网发布：最好的php开发工具编辑：程序博客网时间：2024/06/05 02:24

worldCount是经典的mapreduce程序

环境：linux+spark1.6.2+pycharm

相关文档如下：http://spark.apache.org/docs/1.6.2/api/python/pyspark.html

准备工作：先安装java,maven等环境，下载最新的spark安装文件解压到/data/work/spark-1.6.2目录（我下载的安装文件为spark-1.6.2.tgz）

用maven编译

cd spark-1.6.2
mvn -DskipTests clean package

整个过程会非常痛苦，需要很长时间

安装完成以后，在pycharm上设置开发环境

在环境变量里设置如下：

程序原代码如下：

#!/usr/bin/env python# encoding: utf-8# 代码说明：# 参考文档：# http://spark.apache.org/docs/1.6.2/api/python/pyspark.htmlimport loggingfrom operator import addfrom pyspark import SparkContext"""@version:@software: PyCharm@file: test_python_word_count.py@time: 16-7-4 上午10:39"""logging.basicConfig(format='%(message)s', level=logging.INFO)test_file_name = "/data/work/python-workspace/hualv/spark/test-data.txt"out_file_name = "/data/work/python-workspace/hualv/spark/spark-out"# Word Countsc = SparkContext("local","Simple App")# text_file rdd objecttext_file = sc.textFile(test_file_name)# countscounts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)counts.saveAsTextFile(out_file_name)# # flatMap 先映射后扁平化 Return a new RDD by first applying a function to all elements of this RDD,# # and then flattening the results.# rdd = sc.parallelize([2, 3, 4])# print(rdd.flatMap(lambda x: range(1, x)).collect())# # map 是直接将数据做映射# rdd = sc.parallelize(["b", "a", "c"])# print(rdd.map(lambda x: (x, 1)).collect())# #reduceByKey Merge the values for each key using an associative reduce function.# rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])# print(rdd.reduceByKey(add).collect())

运行结果如下：

/usr/bin/python3.4 /data/work/python-workspace/hualv/test/test_python_word_count.pyUsing Spark's default log4j profile: org/apache/spark/log4j-defaults.properties16/07/04 19:49:36 INFO SparkContext: Running Spark version 1.6.216/07/04 19:49:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable16/07/04 19:49:36 WARN Utils: Your hostname, homolo14-PC resolves to a loopback address: 127.0.1.1; using 192.168.10.197 instead (on interface eth0)16/07/04 19:49:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address16/07/04 19:49:36 INFO SecurityManager: Changing view acls to: homolo16/07/04 19:49:36 INFO SecurityManager: Changing modify acls to: homolo16/07/04 19:49:36 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(homolo); users with modify permissions: Set(homolo)16/07/04 19:49:36 INFO Utils: Successfully started service 'sparkDriver' on port 43166.16/07/04 19:49:36 INFO Slf4jLogger: Slf4jLogger started16/07/04 19:49:37 INFO Remoting: Starting remoting16/07/04 19:49:37 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.10.197:59345]16/07/04 19:49:37 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 59345.16/07/04 19:49:37 INFO SparkEnv: Registering MapOutputTracker16/07/04 19:49:37 INFO SparkEnv: Registering BlockManagerMaster16/07/04 19:49:37 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-d6e6a3ad-739d-4955-9888-5172d4df15a516/07/04 19:49:37 INFO MemoryStore: MemoryStore started with capacity 511.1 MB16/07/04 19:49:37 INFO SparkEnv: Registering OutputCommitCoordinator16/07/04 19:49:37 INFO Utils: Successfully started service 'SparkUI' on port 4040.16/07/04 19:49:37 INFO SparkUI: Started SparkUI at http://192.168.10.197:404016/07/04 19:49:37 INFO Executor: Starting executor ID driver on host localhost16/07/04 19:49:37 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38041.16/07/04 19:49:37 INFO NettyBlockTransferService: Server created on 3804116/07/04 19:49:37 INFO BlockManagerMaster: Trying to register BlockManager16/07/04 19:49:37 INFO BlockManagerMasterEndpoint: Registering block manager localhost:38041 with 511.1 MB RAM, BlockManagerId(driver, localhost, 38041)16/07/04 19:49:37 INFO BlockManagerMaster: Registered BlockManager16/07/04 19:49:37 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 107.7 KB, free 107.7 KB)16/07/04 19:49:38 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 9.8 KB, free 117.5 KB)16/07/04 19:49:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:38041 (size: 9.8 KB, free: 511.1 MB)16/07/04 19:49:38 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-216/07/04 19:49:38 INFO FileInputFormat: Total input paths to process : 116/07/04 19:49:38 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id16/07/04 19:49:38 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id16/07/04 19:49:38 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap16/07/04 19:49:38 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition16/07/04 19:49:38 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id16/07/04 19:49:38 INFO SparkContext: Starting job: saveAsTextFile at NativeMethodAccessorImpl.java:-216/07/04 19:49:38 INFO DAGScheduler: Registering RDD 3 (reduceByKey at /data/work/python-workspace/hualv/test/test_python_word_count.py:31)16/07/04 19:49:38 INFO DAGScheduler: Got job 0 (saveAsTextFile at NativeMethodAccessorImpl.java:-2) with 1 output partitions16/07/04 19:49:38 INFO DAGScheduler: Final stage: ResultStage 1 (saveAsTextFile at NativeMethodAccessorImpl.java:-2)16/07/04 19:49:38 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)16/07/04 19:49:38 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)16/07/04 19:49:38 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at /data/work/python-workspace/hualv/test/test_python_word_count.py:31), which has no missing parents16/07/04 19:49:38 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 8.4 KB, free 125.9 KB)16/07/04 19:49:38 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 5.4 KB, free 131.3 KB)16/07/04 19:49:38 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:38041 (size: 5.4 KB, free: 511.1 MB)16/07/04 19:49:38 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:100616/07/04 19:49:38 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at /data/work/python-workspace/hualv/test/test_python_word_count.py:31)16/07/04 19:49:38 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks16/07/04 19:49:38 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2149 bytes)16/07/04 19:49:38 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)16/07/04 19:49:38 INFO HadoopRDD: Input split: file:/data/work/python-workspace/hualv/spark/test-data.txt:0+30616/07/04 19:49:38 INFO PythonRunner: Times: total = 439, boot = 432, init = 4, finish = 316/07/04 19:49:39 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2317 bytes result sent to driver16/07/04 19:49:39 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 516 ms on localhost (1/1)16/07/04 19:49:39 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/07/04 19:49:39 INFO DAGScheduler: ShuffleMapStage 0 (reduceByKey at /data/work/python-workspace/hualv/test/test_python_word_count.py:31) finished in 0.529 s16/07/04 19:49:39 INFO DAGScheduler: looking for newly runnable stages16/07/04 19:49:39 INFO DAGScheduler: running: Set()16/07/04 19:49:39 INFO DAGScheduler: waiting: Set(ResultStage 1)16/07/04 19:49:39 INFO DAGScheduler: failed: Set()16/07/04 19:49:39 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[8] at saveAsTextFile at NativeMethodAccessorImpl.java:-2), which has no missing parents16/07/04 19:49:39 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 52.3 KB, free 183.5 KB)16/07/04 19:49:39 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 19.3 KB, free 202.8 KB)16/07/04 19:49:39 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:38041 (size: 19.3 KB, free: 511.1 MB)16/07/04 19:49:39 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:100616/07/04 19:49:39 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[8] at saveAsTextFile at NativeMethodAccessorImpl.java:-2)16/07/04 19:49:39 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks16/07/04 19:49:39 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,NODE_LOCAL, 1894 bytes)16/07/04 19:49:39 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)16/07/04 19:49:39 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir16/07/04 19:49:39 INFO deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class16/07/04 19:49:39 INFO deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class16/07/04 19:49:39 INFO deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir16/07/04 19:49:39 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks16/07/04 19:49:39 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms16/07/04 19:49:39 INFO PythonRunner: Times: total = 39, boot = -76, init = 115, finish = 016/07/04 19:49:39 INFO FileOutputCommitter: Saved output of task 'attempt_201607041949_0001_m_000000_1' to file:/data/work/python-workspace/hualv/spark/spark-out/_temporary/0/task_201607041949_0001_m_00000016/07/04 19:49:39 INFO SparkHadoopMapRedUtil: attempt_201607041949_0001_m_000000_1: Committed16/07/04 19:49:39 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1229 bytes result sent to driver16/07/04 19:49:39 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 94 ms on localhost (1/1)16/07/04 19:49:39 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 16/07/04 19:49:39 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at NativeMethodAccessorImpl.java:-2) finished in 0.095 s16/07/04 19:49:39 INFO DAGScheduler: Job 0 finished: saveAsTextFile at NativeMethodAccessorImpl.java:-2, took 0.712578 s16/07/04 19:49:39 INFO SparkContext: Invoking stop() from shutdown hookProcess finished with exit code 0

test-data.txt文件内容如下：

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

运行结果截图：

0 0