Spark笔记 11_10_2017
来源:互联网 发布:网站编程语言有哪些 编辑:程序博客网 时间:2024/06/07 20:02
1 spark 读入数据
spark.read.csv(,,,,)
- path: location of files. Similar to Spark can accept standard Hadoop globbing expressions.
- header: when set to true ,the first line of files will be used to name columns and will not be included in data. All types will be assumed string. Default value is false.
- delimiter: by default columns are delimited using ,, but delimiter can be set to any character
- quote: by default the quote character is “, but can be set to any character. Delimiters inside quotes are ignored
- escape: by default the escape character is , but can be set to any character. Escaped quote characters are ignored
- parserLib: by default it is “commons” can be set to “univocity” to use that library for CSV parsing.
- mode: determines the parsing mode. By default it is PERMISSIVE. Possible values are:
- PERMISSIVE: tries to parse all lines: nulls are inserted for missing tokens and extra tokens are ignored.
- DROPMALFORMED: drops lines which have fewer or more tokens than expected or tokens which do not match the schema
- FAILFAST: aborts with a RuntimeException if encounters any malformed line
- charset: defaults to ‘UTF-8’ but can be set to other valid charset names
- inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default
- comment: skip lines beginning with this character. Default is “#”. Disable comments by setting this to null.
- nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame
- dateFormat: specifies a string that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. By default, it is null which means trying to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf().
Spark 1.4+:
Automatically infer schema (data types), otherwise everything is assumed string:
from pyspark.sql import SQLContextsqlContext = SQLContext(sc)df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')df.select('year', 'model').write.format('com.databricks.spark.csv').save('newcars.csv')
You can manually specify schema:
from pyspark.sql import SQLContextfrom pyspark.sql.types import *sqlContext = SQLContext(sc)customSchema = StructType([ \ StructField("year", IntegerType(), True), \ StructField("make", StringType(), True), \ StructField("model", StringType(), True), \ StructField("comment", StringType(), True), \ StructField("blank", StringType(), True)])df = sqlContext.read \ .format('com.databricks.spark.csv') \ .options(header='true') \ .load('cars.csv', schema = customSchema)df.select('year', 'model').write \ .format('com.databricks.spark.csv') \ .save('newcars.csv')
You can save with compressed output:
from pyspark.sql import SQLContextsqlContext = SQLContext(sc)df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')df.select('year', 'model').write.format('com.databricks.spark.csv').options(codec="org.apache.hadoop.io.compress.GzipCodec").save('newcars.csv')
Programmatically Specifying the Schema
When a dictionary of kwargs cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.
Create an RDD of tuples or lists from the original RDD;
Create the schema represented by a StructType matching the structure of tuples or lists in the RDD created in the step 1.
Apply the schema to the RDD via createDataFrame method provided by SparkSession.
For example:
# Import data typesfrom pyspark.sql.types import *sc = spark.sparkContext# Load a text file and convert each line to a Row.lines = sc.textFile("examples/src/main/resources/people.txt")parts = lines.map(lambda l: l.split(","))# Each line is converted to a tuple.people = parts.map(lambda p: (p[0], p[1].strip()))# The schema is encoded in a string.schemaString = "name age"fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]schema = StructType(fields)# Apply the schema to the RDD.schemaPeople = spark.createDataFrame(people, schema)# Creates a temporary view using the DataFrameschemaPeople.createOrReplaceTempView("people")# SQL can be run over DataFrames that have been registered as a table.results = spark.sql("SELECT name FROM people")results.show()# +-------+# | name|# +-------+# |Michael|# | Andy|# | Justin|# +-------+
参考:https://github.com/databricks/spark-csv
- Spark笔记 11_10_2017
- Spark实例TopN---Spark学习笔记11
- Spark笔记
- spark笔记
- Spark笔记
- spark笔记
- spark 笔记
- SPARK笔记
- Spark笔记
- spark笔记
- Spark笔记
- spark笔记
- spark笔记
- Spark系列修炼---入门笔记11
- Learning Spark笔记11-加载保存数据
- spark 笔记1 (11-06-2017)
- spark学习笔记:Spark Streaming
- Spark学习笔记:初识Spark
- JAVA | 4
- LWC 57:720. Longest Word in Dictionary
- RenderTexture截屏出现白屏问题
- 史上最简单的 SpringCloud 教程 | 终章
- python中如何实现一行输入多个值
- Spark笔记 11_10_2017
- BloomFilter(大数据去重)+Redis(持久化)策略
- linux文件系统命令 df和du的区别
- linux下lua环境搭建
- springboot微服务搭建(一):搭建springboot框架
- 常用的Hql语句
- 编译时向 go 程序写入 git 版本信息
- activity启动流程
- Ubuntu16.04下设置tomcat开机启动.md