spark-parquet列存储之:文件存储细节之:RowWriteSupport和RecordWriter

来源:互联网 发布:linux安全加固方案 编辑:程序博客网 时间:2024/05/21 05:40

RowWriteSupport

RowWriteSupport继承自WriteSupport(该文件位于hadoop.parquet jar),为数据写入提供WriteContext支持

field:

writer:RecordConsumer

数据的消费者,负责写入数据,其有三个子类:
ValidatingRecordConsumer,负责校验工作,然后传递给下一个消费者执行同样的操作
RecordConsumerLoggingWrapper,负责记录日志,然后传递给下一个消费者执行同样的操作
MessageColumnIORecordConsumer,负责写入数据

attributes: Seq[Attribute]

内部含有Schema信息,FieldName,DataType,isNullable

function

初始化过程:

init(configuration: Configuration): WriteSupport.WriteContext


val origAttributesStr: String = configuration.get(RowWriteSupport.SPARK_ROW_SCHEMA)    val metadata = new JHashMap[String, String]()    metadata.put(RowReadSupport.SPARK_METADATA_KEY, origAttributesStr)    if (attributes == null) {      attributes = ParquetTypesConverter.convertFromString(origAttributesStr)    }    log.debug(s"write support initialized for requested schema $attributes")    ParquetRelation.enableLogForwarding()    new WriteSupport.WriteContext(ParquetTypesConverter.convertFromAttributes(attributes), metadata)

获取数据的Schema信息,并传递给WriteContext
获取ParquetTypeInfo的详细过程如下:
  def fromPrimitiveDataType(ctype: DataType): Option[ParquetTypeInfo] = ctype match {    case StringType => Some(ParquetTypeInfo(      ParquetPrimitiveTypeName.BINARY, Some(ParquetOriginalType.UTF8)))    case BinaryType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.BINARY))    case BooleanType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.BOOLEAN))    case DoubleType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.DOUBLE))    case FloatType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FLOAT))    case IntegerType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT32))    // There is no type for Byte or Short so we promote them to INT32.    case ShortType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT32))    case ByteType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT32))    case LongType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT64))    case DecimalType.Fixed(precision, scale) if precision <= 18 =>      // TODO: for now, our writer only supports decimals that fit in a Long      Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY,        Some(ParquetOriginalType.DECIMAL),        Some(new DecimalMetadata(precision, scale)),        Some(BYTES_FOR_PRECISION(precision))))    case _ => None  }

然后将ParquetTypeInfo包装成MessageType
  def convertFromAttributes(attributes: Seq[Attribute]): MessageType = {    val fields = attributes.map(      attribute =>        fromDataType(attribute.dataType, attribute.name, attribute.nullable))    new MessageType("root", fields)  }




RecordWriter

0 0
原创粉丝点击