pyspark-hdfs数据操作

来源：互联网发布：车范根数据编辑：程序博客网时间：2024/06/02 23:48

参考：

1、http://spark.apache.org/docs/1.2.0/api/python/pyspark.html

2、http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

一、SparkContext API

1、读取hdfs数据转成numpy

#!/usr/bin/python# -*- coding: UTF-8 -*-from pyspark import SparkContext,SparkConfimport numpy as npimport pickledirPath='hdfs://xxx/user/root/data_16/11/labels/part-00199' # 注该数据为pickle格式sc = SparkContext(conf=SparkConf().setAppName("The first example"))# textFiles=sc.textFile(dirPath)textFiles=sc.pickleFile(dirPath)data=textFiles.collect()# print(data[:5])print(type(data)) # <type 'list'>print(data[0].dtype) # float16data=np.array(data,np.float32) # 转成arraynp.save('123.npy',data) # 保存数据到本地np.load('123.npy') # 加载数据

2、wholeTextFiles 读取目录下的所有数据（本地或 hdfs）

wholeTextFiles(path, minPartitions=None, use_unicode=True)

For example, if you have the following files:

hdfs://a-hdfs-path/part-00000hdfs://a-hdfs-path/part-00001...hdfs://a-hdfs-path/part-nnnnn

Do rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”), then rdd contains:

(a-hdfs-path/part-00000, its content)(a-hdfs-path/part-00001, its content)...(a-hdfs-path/part-nnnnn, its content)

#!/usr/bin/python# -*- coding: UTF-8 -*-from pyspark import SparkContext,SparkConfimport os# from pyspark.context import SparkContext# from pyspark.conf import SparkConf#from pyspark.sql import DataFrame,SQLContextsc = SparkContext(conf=SparkConf().setAppName("The first example"))dirPath = os.path.join('./', "files") # dirPath 也可以是hdfs上的文件os.mkdir(dirPath)with open(os.path.join(dirPath, "1.txt"), "w") as file1:    file1.write("10")with open(os.path.join(dirPath, "2.txt"), "w") as file2:    file2.write("20")textFiles = sc.wholeTextFiles(dirPath)# sorted(textFiles.collect())print(type(textFiles)) # <class 'pyspark.rdd.RDD'>print(textFiles.collect())print(type(textFiles.collect())) # list# [(u'.../1.txt', u'10'), (u'.../2.txt', u'20')]print(len(textFiles.collect())) # 2

`3、addFile`(path)

Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

To access the file in Spark jobs, use SparkFiles.get(fileName) with the filename to find its download location.

#!/usr/bin/python# -*- coding: UTF-8 -*-

from pyspark import SparkFilesfrom pyspark import SparkContext,SparkConfimport ossc = SparkContext(conf=SparkConf().setAppName("The first example"))path = os.path.join('./', "test.txt") # 也可以说hdfs路径with open(path, "w") as testFile:    testFile.write("100")sc.addFile(path) # Add a file to be downloaded with this Spark job on every nodedef func(iterator):    with open(SparkFiles.get("test.txt")) as testFile: # SparkFiles.get(path)        fileVal = int(testFile.readline())        return [x * fileVal for x in iterator]print(sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect())# [100, 200, 300, 400]

执行：spark-submit test2.py

`4、addPyFile`(path)¶

Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

import pyspark_csv as pycsvsc.addPyFile('pyspark_csv.py')

#!/usr/bin/python# -*- coding: UTF-8 -*-from pyspark import SparkContext,SparkConffrom pyspark import SparkFilesimport pyspark_csv as pycsvimport ossc = SparkContext(conf=SparkConf().setAppName("The first example"))sc.addPyFile('pyspark_csv.py')# print(SparkFiles.get("pyspark_csv.py")) # 返回文件的绝对路径os.popen("python "+SparkFiles.get("pyspark_csv.py")) # 执行脚本

`5、binaryFiles`(path, minPartitions=None)

:: Experimental

Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

Note: Small files are preferred, large file is also allowable, but may cause bad performance.

#!/usr/bin/python# -*- coding: UTF-8 -*-from pyspark import SparkFilesfrom pyspark import SparkContext,SparkConfimport ossc = SparkContext(conf=SparkConf().setAppName("The first example"))dirPath='hdfs://xxx/user/root/data_16/11/labels/part-00199'data=sc.binaryFiles(dirPath) # Read a directory of binary files from HDFSprint(data) # org.apache.spark.api.java.JavaPairRDD@27a22ddc

`6、clearFiles`()¶

Clear the job’s list of files added by addFile or addPyFile so that they do not get downloaded to any new nodes.

二、RDD API

1、保存文件

saveAsPickleFile

saveAsPickleFile(path, batchSize=10)¶

>>> tmpFile = NamedTemporaryFile(delete=True)>>> tmpFile.close()>>> sc.parallelize([1, 2, 'spark', 'rdd']).saveAsPickleFile(tmpFile.name, 3)>>> sorted(sc.pickleFile(tmpFile.name, 5).collect())[1, 2, 'rdd', 'spark']

saveAsTextFile

Save this RDD as a SequenceFile of serialized objects. The serializer used is pyspark.serializers.PickleSerializer, default batch size is 10.

saveAsTextFile(path)

Save this RDD as a text file, using string representations of elements.

>>> tempFile = NamedTemporaryFile(delete=True)>>> tempFile.close()>>> sc.parallelize(range(10)).saveAsTextFile(tempFile.name)>>> from fileinput import input>>> from glob import glob>>> ''.join(sorted(input(glob(tempFile.name + "/part-0000*"))))'0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n'

Empty lines are tolerated when saving to text files.

>>> tempFile2 = NamedTemporaryFile(delete=True)>>> tempFile2.close()>>> sc.parallelize(['', 'foo', '', 'bar', '']).saveAsTextFile(tempFile2.name)>>> ''.join(sorted(input(glob(tempFile2.name + "/part-0000*"))))'\n\n\nbar\nfoo\n'

三、SparkFiles

Resolves paths to files added through SparkContext.addFile()

classmethod get(filename) # 获取文件的绝对路径

Get the absolute path of a file added through SparkContext.addFile().

classmethod getRootDirectory()¶

Get the root directory that contains files added through SparkContext.addFile().

#!/usr/bin/python# -*- coding: UTF-8 -*-

from pyspark import SparkFilesfrom pyspark import SparkContext,SparkConfimport ossc = SparkContext(conf=SparkConf().setAppName("The first example"))path = os.path.join('./', "test.txt") # 也可以说hdfs路径with open(path, "w") as testFile:    testFile.write("100")sc.addFile(path) # Add a file to be downloaded with this Spark job on every nodedef func(iterator):    with open(SparkFiles.get("test.txt")) as testFile: # SparkFiles.get(path)        fileVal = int(testFile.readline())        return [x * fileVal for x in iterator]print(sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect())# [100, 200, 300, 400]

四、DataFrameReader

csv

>>> df = spark.read.csv('python/test_support/sql/ages.csv')>>> df.dtypes[('_c0', 'string'), ('_c1', 'string')]

`format`(source)

Specifies the input data source format.

Parameters:source – string, name of the data source, e.g. ‘json’, ‘parquet’.

>>> df = spark.read.format('json').load('python/test_support/sql/people.json')>>> df.dtypes[('age', 'bigint'), ('name', 'string')]

json

Loads JSON files and returns the results as a DataFrame.

>>> df1 = spark.read.json('python/test_support/sql/people.json')>>> df1.dtypes[('age', 'bigint'), ('name', 'string')]>>> rdd = sc.textFile('python/test_support/sql/people.json')>>> df2 = spark.read.json(rdd)>>> df2.dtypes[('age', 'bigint'), ('name', 'string')]

`load`(path=None, format=None, schema=None, **options)

>>> df = spark.read.load('python/test_support/sql/parquet_partitioned', opt1=True,...     opt2=1, opt3='str')>>> df.dtypes[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]

>>> df = spark.read.format('json').load(['python/test_support/sql/people.json',...     'python/test_support/sql/people1.json'])>>> df.dtypes[('age', 'bigint'), ('aka', 'string'), ('name', 'string')]

`orc`(path)

Loads ORC files, returning the result as a DataFrame.

Note

Currently ORC support is only available together with Hive support.

>>> df = spark.read.orc('python/test_support/sql/orc_partitioned')>>> df.dtypes[('a', 'bigint'), ('b', 'int'), ('c', 'int')]

`parquet`(*paths)¶

Loads Parquet files, returning the result as a DataFrame.

>>> df = spark.read.parquet('python/test_support/sql/parquet_partitioned')>>> df.dtypes[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]

`text`(paths)

>>> df = spark.read.text('python/test_support/sql/text-test.txt')>>> df.collect()[Row(value=u'hello'), Row(value=u'this')]

#!/usr/bin/python# -*- coding: UTF-8 -*-from pyspark import SparkFilesfrom pyspark import SparkContext,SparkConffrom pyspark.sql import DataFrame,SQLContext,DataFrameReaderimport osfrom pyspark.sql import SparkSession# sc = SparkContext(conf=SparkConf().setAppName("The first example"))path = os.path.join('./', "dna_seq.txt") # 也可以是hdfs路径spark = SparkSession.builder \     .master("local") \     .appName("Word Count") \     .config("spark.some.config.option", "some-value") \     .getOrCreate()df=spark.read.text(path)# spark.read.json("hdfs://localhost:9000/testdata/person.json")# spark.read.csv()print(type(df)) # <class 'pyspark.sql.dataframe.DataFrame'>

五、DataFrameWriter

Use DataFrame.write() to access this.

csv

>>> df.write.csv(os.path.join(tempfile.mkdtemp(), 'data'))

`format`(source)

Specifies the underlying output data source.

Parameters:source – string, name of the data source, e.g. ‘json’, ‘parquet’.

>>> df.write.format('json').save(os.path.join(tempfile.mkdtemp(), 'data'))

json

>>> df.write.json(os.path.join(tempfile.mkdtemp(), 'data'))

`mode`(saveMode)

Options include:

append: Append contents of this DataFrame to existing data.
overwrite: Overwrite existing data.
error: Throw an exception if data already exists.
ignore: Silently ignore this operation if data already exists.

>>> df.write.mode('append').parquet(os.path.join(tempfile.mkdtemp(), 'data'))

save(path=None, format=None, mode=None, partitionBy=None, **options)

>>> df.write.mode('append').parquet(os.path.join(tempfile.mkdtemp(), 'data'))

text(path, compression=None)

Saves the content of the DataFrame in a text file at the specified path

Specifies the behavior when data or table already exists.

六、DataStreamReader

Use spark.readStream() to access this

csv

>>> csv_sdf = spark.readStream.csv(tempfile.mkdtemp(), schema = sdf_schema)>>> csv_sdf.isStreamingTrue>>> csv_sdf.schema == sdf_schemaTrue

`format`(source)[source]

Specifies the input data source format.

Note

Evolving.

Parameters:source – string, name of the data source, e.g. ‘json’, ‘parquet’.

>>> s = spark.readStream.format("text")

json

>>> json_sdf = spark.readStream.json(tempfile.mkdtemp(), schema = sdf_schema)>>> json_sdf.isStreamingTrue>>> json_sdf.schema == sdf_schemaTrue

load(path=None, format=None, schema=None, **options)

>>> json_sdf = spark.readStream.format("json") \...     .schema(sdf_schema) \...     .load(tempfile.mkdtemp())>>> json_sdf.isStreamingTrue>>> json_sdf.schema == sdf_schemaTrue

text(path)

>>> text_sdf = spark.readStream.text(tempfile.mkdtemp())>>> text_sdf.isStreamingTrue>>> "value" in str(text_sdf.schema)True

七、附加 hadoop 文件操作命令

hdfs dfs -ls # 显示目录hdfs dfs -ls xxx/|wc -l # 显示xxx目录下的文件和文件夹个数hdfs dfs -mkdir xxx # 新建目录hdfs dfs -rm -r xxx # 删除文件或目录hdfs dfs -put  xxx data # 将xxx 上传到 hdfs的data目录hdfs dfs -get xxx ./ # 将hdfs的xxx（文件或文件夹）复制到本地yarn application -kill application_1502181070712_0574  # 杀掉进程spark-submit test.py  # 执行脚本 test.py

阅读全文

0 0

pyspark-hdfs数据操作

一、SparkContext API

1、读取hdfs数据转成numpy

2、wholeTextFiles 读取目录下的所有数据（本地 或 hdfs）

3、addFile(path)

4、addPyFile(path)¶

5、binaryFiles(path, minPartitions=None)

6、clearFiles()¶

二、RDD API

1、保存文件

saveAsPickleFile

saveAsTextFile

三、SparkFiles

四、DataFrameReader

csv

format(source)

json

load(path=None, format=None, schema=None, **options)

orc(path)

parquet(*paths)¶

text(paths)

五、DataFrameWriter

csv

format(source)

json

mode(saveMode)

save(path=None, format=None, mode=None, partitionBy=None, **options)

text(path, compression=None)

六、DataStreamReader

csv

format(source)[source]

json

load(path=None, format=None, schema=None, **options)

text(path)

七、附加 hadoop 文件操作命令

2、wholeTextFiles 读取目录下的所有数据（本地或 hdfs）

`3、addFile`(path)

`4、addPyFile`(path)¶

`5、binaryFiles`(path, minPartitions=None)

`6、clearFiles`()¶

`format`(source)

`load`(path=None, format=None, schema=None, **options)

`orc`(path)

`parquet`(*paths)¶

`text`(paths)

`format`(source)

`mode`(saveMode)

`format`(source)[source]