Spark 读取CSV 解析单元格多行数值问题

来源:互联网 发布:基金投资组合 知乎 编辑:程序博客网 时间:2024/06/07 23:22

CSV 样例数据

[hadoop@ip-10-0-52-52 ~]$ cat test.csv id,name,address1,zhang san,china shanghai2,li si,"chinabeijing"3,tom,china shanghai

Spark 2.2 以下版本读取 CSV

会存在读取异常问题

scala> val df1 = spark.read.option("header", true).csv("file:///home/hadoop/test.csv")df1: org.apache.spark.sql.DataFrame = [id: string, name: string ... 1 more field]scala> df1.countres4: Long = 4scala> df1.show+--------+---------+--------------+|      id|     name|       address|+--------+---------+--------------+|       1|zhang san|china shanghai||       2|    li si|         china||beijing"|     null|          null||       3|      tom|china shanghai|+--------+---------+--------------+

遇到该问题也是可以通过读取二进制文件来解决的, 但这不是好的方案,例如下面Pyspark 实现:

def spark_read_csv_bf(spark, path, schema=None, encoding='utf8'):    '''    :param spark:    spark 2.0 sparkSession     :param path:     csv path    :param encoding:     :return: DataFrame    '''    rdd = spark.sparkContext.binaryFiles(path).values()\                .flatMap(lambda x: csv.DictReader(io.BytesIO(x)))\                .map(lambda x : { k:v.decode(encoding) for  k,v in x.iteritems()})    if schema:        return spark.createDataFrame(rdd, schema)    else:        return rdd.toDF()

Spark 2.2 之后版本 读取 CSV

spark 2.2之后的版本对该bug 进行修复, 具体的实现可以去看下, 通过在函数调用时添加参数 multiLine 解决了该问题, 参考链接:

[SPARK-19610][SQL] Support parsing multiline CSV files

[SPARK-20980] [SQL] Rename wholeFile to multiLine for both CSV and JSON

scala> val df2 = spark.read.option("header", true).option("multiLine", true).csv("file:///home/hadoop/test.csv")df2: org.apache.spark.sql.DataFrame = [id: string, name: string ... 1 more field]scala> df2.countres6: Long = 3scala> df2.show+---+---------+--------------+| id|     name|       address|+---+---------+--------------+|  1|zhang san|china shanghai||  2|    li si| chinabeijing||  3|      tom|china shanghai|+---+---------+--------------+