spark 命令行环境 python

来源：互联网发布：常见自动控制算法编辑：程序博客网时间：2024/06/14 16:05

1. 安装python，安装好后查看python版本

$ python --version
Python 2.7.6

从下面的pyspark.sh中可以看出，默认是支持2.7的python（spark版本是spark-1.6.0-bin-hadoop2.6）

if hash python2.7 2>/dev/null; then  # Attempt to use Python 2.7, if installed:  DEFAULT_PYTHON="python2.7"else  DEFAULT_PYTHON="python"fi

2.运行pyspark

/usr/local/spark$ bin/pyspark

16/01/24 09:34:51 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-bf84dcd6-0789-4ceb-b950-288d6617955c16/01/24 09:34:51 INFO MemoryStore: MemoryStore started with capacity 517.4 MB16/01/24 09:34:51 INFO SparkEnv: Registering OutputCommitCoordinator16/01/24 09:34:52 INFO Utils: Successfully started service 'SparkUI' on port 4040.16/01/24 09:34:52 INFO SparkUI: Started SparkUI at http://192.168.0.101:404016/01/24 09:34:52 INFO Executor: Starting executor ID driver on host localhost16/01/24 09:34:52 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54466.16/01/24 09:34:52 INFO NettyBlockTransferService: Server created on 5446616/01/24 09:34:52 INFO BlockManagerMaster: Trying to register BlockManager16/01/24 09:34:52 INFO BlockManagerMasterEndpoint: Registering block manager localhost:54466 with 517.4 MB RAM, BlockManagerId(driver, localhost, 54466)16/01/24 09:34:52 INFO BlockManagerMaster: Registered BlockManager

lines=sc.textFile("README.md")

16/01/24 09:35:44 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes16/01/24 09:35:44 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 147.1 KB, free 147.1 KB)16/01/24 09:35:44 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 14.3 KB, free 161.4 KB)16/01/24 09:35:44 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:54466 (size: 14.3 KB, free: 517.4 MB)16/01/24 09:35:44 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2 <pre name="code" class="html">>>> lines.count()

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/spark/python/pyspark/rdd.py", line 1004, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "/usr/local/spark/python/pyspark/rdd.py", line 995, in sum return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add) File "/usr/local/spark/python/pyspark/rdd.py", line 869, in fold vals = self.mapPartitions(func).collect() File "/usr/local/spark/python/pyspark/rdd.py", line 771, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/usr/local/spark/python/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_valuepy4j.protocol.Py4JJavaError

以上错误可能是因为本地的hadoop集群还没有启动造成的

exit()退出pyspark

启动hadoop集群

/usr/local/hadoop/sbin/start-all.sh

3704 ResourceManager3541 SecondaryNameNode3194 NameNode4155 Jps3329 DataNode3839 NodeManager

5. 再次启动pyspark

/usr/local/spark$ bin/pyspark

lines=sc.textFile("README.md")

>>> lines.count()

16/01/24 09:41:44 INFO HadoopRDD: Input split: hdfs://namenode:9000/user/tizen/README.md:1679+168016/01/24 09:41:44 INFO HadoopRDD: Input split: hdfs://namenode:9000/user/tizen/README.md:0+167916/01/24 09:41:44 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id16/01/24 09:41:44 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id16/01/24 09:41:44 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap16/01/24 09:41:44 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition16/01/24 09:41:44 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id16/01/24 09:41:46 INFO PythonRunner: Times: total = 1903, boot = 1648, init = 254, finish = 116/01/24 09:41:46 INFO PythonRunner: Times: total = 51, boot = 5, init = 45, finish = 116/01/24 09:41:46 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2124 bytes result sent to driver16/01/24 09:41:46 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2124 bytes result sent to driver16/01/24 09:41:46 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2361 ms on localhost (1/2)16/01/24 09:41:46 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2307 ms on localhost (2/2)16/01/24 09:41:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/01/24 09:41:46 INFO DAGScheduler: ResultStage 0 (count at <stdin>:1) finished in 2.618 s16/01/24 09:41:46 INFO DAGScheduler: Job 0 finished: count at <stdin>:1, took 2.970501 s95

结果显示正常

6. IPYTHON是很多python程序员比较钟爱的工具，支持代码补全等功能，步骤如下：

a apt-get install ipython

b .bashrc中增加环境变量export IPYTHON=1

/usr/local/spark$ bin/pyspark

lines=sc.textFile("README.md")

>>> lines.count()

结果同上，区别，可以按tab键，代码补全

ps:

可能出现找不到sc的问题，此时可以手动导入

from pyspark import SparkContext

sc=SparkContext()

lines=sc.textFile("README.md")

lines.count()

0 0