spark-parquet列存储之:文件存储细节之:RowWriteSupport和RecordWriter
来源:互联网 发布:linux安全加固方案 编辑:程序博客网 时间:2024/05/21 05:40
RowWriteSupport
RowWriteSupport继承自WriteSupport(该文件位于hadoop.parquet jar),为数据写入提供WriteContext支持
field:
writer:RecordConsumer
数据的消费者,负责写入数据,其有三个子类:
ValidatingRecordConsumer,负责校验工作,然后传递给下一个消费者执行同样的操作
RecordConsumerLoggingWrapper,负责记录日志,然后传递给下一个消费者执行同样的操作
MessageColumnIORecordConsumer,负责写入数据
attributes: Seq[Attribute]
内部含有Schema信息,FieldName,DataType,isNullable
function
初始化过程:
init(configuration: Configuration): WriteSupport.WriteContext
val origAttributesStr: String = configuration.get(RowWriteSupport.SPARK_ROW_SCHEMA) val metadata = new JHashMap[String, String]() metadata.put(RowReadSupport.SPARK_METADATA_KEY, origAttributesStr) if (attributes == null) { attributes = ParquetTypesConverter.convertFromString(origAttributesStr) } log.debug(s"write support initialized for requested schema $attributes") ParquetRelation.enableLogForwarding() new WriteSupport.WriteContext(ParquetTypesConverter.convertFromAttributes(attributes), metadata)
获取数据的Schema信息,并传递给WriteContext
获取ParquetTypeInfo的详细过程如下:
def fromPrimitiveDataType(ctype: DataType): Option[ParquetTypeInfo] = ctype match { case StringType => Some(ParquetTypeInfo( ParquetPrimitiveTypeName.BINARY, Some(ParquetOriginalType.UTF8))) case BinaryType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.BINARY)) case BooleanType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.BOOLEAN)) case DoubleType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.DOUBLE)) case FloatType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FLOAT)) case IntegerType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT32)) // There is no type for Byte or Short so we promote them to INT32. case ShortType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT32)) case ByteType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT32)) case LongType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT64)) case DecimalType.Fixed(precision, scale) if precision <= 18 => // TODO: for now, our writer only supports decimals that fit in a Long Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY, Some(ParquetOriginalType.DECIMAL), Some(new DecimalMetadata(precision, scale)), Some(BYTES_FOR_PRECISION(precision)))) case _ => None }
然后将ParquetTypeInfo包装成MessageType
def convertFromAttributes(attributes: Seq[Attribute]): MessageType = { val fields = attributes.map( attribute => fromDataType(attribute.dataType, attribute.name, attribute.nullable)) new MessageType("root", fields) }
RecordWriter
0 0
- spark-parquet列存储之:文件存储细节之:RowWriteSupport和RecordWriter
- Spark-parquet列存储之:ParquetTableOperations源码分析
- spark-parquet列存储之:数据写入过程源码分析
- parquet列式存储之:ColumnChunkPageWriteStore源码解析
- parquet列存储
- 列式存储的 parquet文件底层格式
- Hadoop列式存储引擎Parquet/ORC和snappy压缩
- Hadoop列式存储引擎Parquet/ORC和snappy压缩
- 深入分析Parquet列式存储格式
- 深入分析Parquet列式存储格式
- 深入分析Parquet列式存储格式
- 深入分析Parquet列式存储格式
- 新一代列式存储格式Parquet
- 深入分析Parquet列式存储格式
- 深入分析Parquet列式存储格式
- 深入分析Parquet列式存储格式
- 深入分析Parquet列式存储格式
- 深入分析Parquet列式存储格式
- TCP、UDP、IP报头结构体分析
- execl 导出文件
- pdf图表导出总结,同时可以应用到excel中
- 马一鸣AA小助手-1
- android studio 的神器--bitmap
- spark-parquet列存储之:文件存储细节之:RowWriteSupport和RecordWriter
- 基于Detours劫持360启动
- Getting Started with TLM-2.0
- oracle connect by start with
- dojo上传功能有用的文章
- Lua学习(六)-----函数
- 设计模式实例-单例
- 引用,其实不可以改变指向
- POJ2226 Muddy Fields【二分图最小点覆盖】