Hadoop Parquet File 文件的读取

来源:互联网 发布:淘宝店公告栏图片 编辑:程序博客网 时间:2024/06/01 07:14

产生parquet数据

这里通过Spark SQL来从CSV文件中读取数据,然后把这些数据存到parquet文件去。

        SparkContext context = new SparkContext(new           SparkConf().setMaster("local").setAppName("parquet"));         SQLContext sqlContext = new SQLContext(context);        DataFrame dataFrame = sqlContext.read().format("com.databricks.spark.csv")                .option("header","true") //这里如果在csv第一行有属性的话,没有就是"false"                .option("inferSchema","true")//这是自动推断属性列的数据类型。                .load("/home/lake/world.csv");      dataFrame.write().parquet("/home/lake/parquetfile");

这里是先从CSV中读取相应的数据,然后将其写入到parquet文件中,下面是CSV中的数据,你也可以自己定义里面的数据.

name,age,sexshenlei,19,manshenlei,19,manshenlei,19,man

读取Parquet 文件的模式信息

        Configuration config = new Configuration();        ParquetMetadata readFooter = ParquetFileReader.readFooter(config, path);        Map<String,String> schema = readFooter.getFileMetaData().getKeyValueMetaData();;        String allFields = schema.get("org.apache.spark.sql.parquet.row.metadata");

allFiedls的值就是各字段的名称和具体的类型,整体是一个json格式进行展示的。具体的内容是

{"type":"struct","fields":[{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"age","type":"integer","nullable":true,"metadata":{}}]}

读取Parquet 文件中的数据

下面是我将每行数据弄成json格式来进行输出的

   ParquetMetadata readFooter = ParquetFileReader.readFooter(fs.getConf(), path, ParquetMetadataConverter.NO_FILTER);                MessageType schema = readFooter.getFileMetaData().getSchema();                List<Type> columnInfos = schema.getFields();                ParquetReader<Group> reader =                        ParquetReader.builder(new GroupReadSupport(), path).                                withConf(fs.getConf()).build();                int count = 0;                Group recordData = reader.read();                while (count < 10 && recordData != null) {                    int last = columnInfos.size() - 1;                    StringBuilder builder = new StringBuilder();                    builder.append("{\"");                    for (int j = 0; j < columnInfos.size(); j++) {                        if (j < columnInfos.size() - 1) {                            String columnName = columnInfos.get(j).getName();                            String value = recordData.getValueToString(j, 0);                            builder.append(columnName + "\":\"" + value + "\",");                        }                    }                    String columnName = columnInfos.get(last).getName();                    String value = recordData.getValueToString(last, 0);                    System.out.println(builder.toString());                    count++;                    recordData = reader.read();                }            } catch (Exception e) {            }        }
0 0