spark 命令行环境 python
来源:互联网 发布:常见自动控制算法 编辑:程序博客网 时间:2024/06/14 16:05
1. 安装python,安装好后查看python版本
$ python --version
Python 2.7.6
从下面的pyspark.sh中可以看出,默认是支持2.7的python(spark版本是spark-1.6.0-bin-hadoop2.6)
if hash python2.7 2>/dev/null; then # Attempt to use Python 2.7, if installed: DEFAULT_PYTHON="python2.7"else DEFAULT_PYTHON="python"fi
2.运行pyspark
/usr/local/spark$ bin/pyspark
16/01/24 09:34:51 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-bf84dcd6-0789-4ceb-b950-288d6617955c16/01/24 09:34:51 INFO MemoryStore: MemoryStore started with capacity 517.4 MB16/01/24 09:34:51 INFO SparkEnv: Registering OutputCommitCoordinator16/01/24 09:34:52 INFO Utils: Successfully started service 'SparkUI' on port 4040.16/01/24 09:34:52 INFO SparkUI: Started SparkUI at http://192.168.0.101:404016/01/24 09:34:52 INFO Executor: Starting executor ID driver on host localhost16/01/24 09:34:52 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54466.16/01/24 09:34:52 INFO NettyBlockTransferService: Server created on 5446616/01/24 09:34:52 INFO BlockManagerMaster: Trying to register BlockManager16/01/24 09:34:52 INFO BlockManagerMasterEndpoint: Registering block manager localhost:54466 with 517.4 MB RAM, BlockManagerId(driver, localhost, 54466)16/01/24 09:34:52 INFO BlockManagerMaster: Registered BlockManager3.
lines=sc.textFile("README.md")
16/01/24 09:35:44 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes16/01/24 09:35:44 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 147.1 KB, free 147.1 KB)16/01/24 09:35:44 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 14.3 KB, free 161.4 KB)16/01/24 09:35:44 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:54466 (size: 14.3 KB, free: 517.4 MB)16/01/24 09:35:44 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2 <pre name="code" class="html">>>> lines.count()Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/spark/python/pyspark/rdd.py", line 1004, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "/usr/local/spark/python/pyspark/rdd.py", line 995, in sum return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add) File "/usr/local/spark/python/pyspark/rdd.py", line 869, in fold vals = self.mapPartitions(func).collect() File "/usr/local/spark/python/pyspark/rdd.py", line 771, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/usr/local/spark/python/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_valuepy4j.protocol.Py4JJavaError
以上错误可能是因为本地的hadoop集群还没有启动造成的
4.
exit()退出pyspark
启动hadoop集群
/usr/local/hadoop/sbin/start-all.sh
3704 ResourceManager3541 SecondaryNameNode3194 NameNode4155 Jps3329 DataNode3839 NodeManager
5. 再次启动pyspark
/usr/local/spark$ bin/pyspark
lines=sc.textFile("README.md")
>>> lines.count()
16/01/24 09:41:44 INFO HadoopRDD: Input split: hdfs://namenode:9000/user/tizen/README.md:1679+168016/01/24 09:41:44 INFO HadoopRDD: Input split: hdfs://namenode:9000/user/tizen/README.md:0+167916/01/24 09:41:44 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id16/01/24 09:41:44 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id16/01/24 09:41:44 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap16/01/24 09:41:44 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition16/01/24 09:41:44 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id16/01/24 09:41:46 INFO PythonRunner: Times: total = 1903, boot = 1648, init = 254, finish = 116/01/24 09:41:46 INFO PythonRunner: Times: total = 51, boot = 5, init = 45, finish = 116/01/24 09:41:46 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2124 bytes result sent to driver16/01/24 09:41:46 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2124 bytes result sent to driver16/01/24 09:41:46 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2361 ms on localhost (1/2)16/01/24 09:41:46 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2307 ms on localhost (2/2)16/01/24 09:41:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/01/24 09:41:46 INFO DAGScheduler: ResultStage 0 (count at <stdin>:1) finished in 2.618 s16/01/24 09:41:46 INFO DAGScheduler: Job 0 finished: count at <stdin>:1, took 2.970501 s95
结果显示正常
6. IPYTHON是很多python程序员比较钟爱的工具,支持代码补全等功能,步骤如下:
a apt-get install ipython
b .bashrc中增加环境变量export IPYTHON=1
c
/usr/local/spark$ bin/pyspark
lines=sc.textFile("README.md")
>>> lines.count()结果同上,区别,可以按tab键,代码补全
ps:
可能出现找不到sc的问题,此时可以手动导入
from pyspark import SparkContext
sc=SparkContext()
lines=sc.textFile("README.md")
lines.count()
0 0
- spark 命令行环境 python
- Spark环境搭建 (Python)
- Python用法:命令行和环境
- Python的命令行交互式环境
- spark python 环境搭建 windows10
- Windows环境下配置python spark(windows7+python+spark)
- Spark环境下的Kmeans-Python实现
- Spark平台下Python环境安装
- Windows下Spark python 单机开发环境
- java,python的spark环境搭建
- win7命令行环境下退出python运行环境
- win10 掿建python spark开发环境.和安装hadoop环境
- spark-submit命令行设置
- spark 命令行启动
- 总结命令行06:Spark
- Python开发环境Wing IDE使用教程:命令行调试
- ubuntu 环境下Python命令行tab补齐
- win环境下的cmd命令行python交互时清屏
- hdu5610 Baby Ming and Weight lifting(暴力)
- 移动安全之修改加密带sig签名的APP数据包
- 1419: Red is good 概率与期望 DP
- UVA 1589 象棋
- android之本地文件读取
- spark 命令行环境 python
- 动画效果--漫天飞雪
- RW RO ZI ROM keil中的含义
- BZOJ 1415: [Noi2005]聪聪和可可|概率dp
- 2015年大二上-数据结构-图-1-(2)操作用邻接表存储的图
- webrtc
- menu.lst是什么?
- Leetcode 75. Sort Colors
- 【HDOJ 2255】奔小康赚大钱(KM算法)