pyspark初探(一)LearningSpark

来源:互联网 发布:网络事件营销 编辑:程序博客网 时间:2024/05/16 18:06

启动

pysparkIPYTHON=1 pysparkIPYTHON_OPTS="notebook" pyspark(set IPYTHON=1 pyspark  for windows)

执行python脚本

spark-submit  my_script.py

初始化sparkcontext

from pyspark import SparkConf,SparkContextconf = SparkConf().setMaster("local").setAppName("Myapp")sc = SparkContext(conf=conf)

Ch5读取csv数据

如果没有换行符

import csvimport StringIO...def loadRecord(line):    """Parse a CSV line"""input = StringIO.StringIO(line)reader = csv.DictReader(input, fieldnames=["name", "favouriteAnimal"]) return reader.next()input = sc.textFile(inputFile).map(loadRecord)

如果有换行符

需要把整个数据集加载进来

def loadRecords(fileNameContents):"""Load all the records in a given file"""input = StringIO.StringIO(fileNameContents[1])reader = csv.DictReader(input, fieldnames=["name", "favoriteAnimal"]) return readerfullFileData = sc.wholeTextFiles(inputFile).flatMap(loadRecords)
0 0