spark-parquet列存储之：文件存储细节之：RowWriteSupport和RecordWriter

来源：互联网发布：linux安全加固方案编辑：程序博客网时间：2024/05/21 05:40

RowWriteSupport

RowWriteSupport继承自WriteSupport(该文件位于hadoop.parquet jar)，为数据写入提供WriteContext支持

field：

writer:RecordConsumer

数据的消费者，负责写入数据，其有三个子类：

ValidatingRecordConsumer，负责校验工作，然后传递给下一个消费者执行同样的操作

RecordConsumerLoggingWrapper，负责记录日志，然后传递给下一个消费者执行同样的操作

MessageColumnIORecordConsumer，负责写入数据

attributes: Seq[Attribute]

内部含有Schema信息，FieldName,DataType,isNullable

function

初始化过程：

init(configuration: Configuration): WriteSupport.WriteContext

val origAttributesStr: String = configuration.get(RowWriteSupport.SPARK_ROW_SCHEMA)    val metadata = new JHashMap[String, String]()    metadata.put(RowReadSupport.SPARK_METADATA_KEY, origAttributesStr)    if (attributes == null) {      attributes = ParquetTypesConverter.convertFromString(origAttributesStr)    }    log.debug(s"write support initialized for requested schema $attributes")    ParquetRelation.enableLogForwarding()    new WriteSupport.WriteContext(ParquetTypesConverter.convertFromAttributes(attributes), metadata)

获取数据的Schema信息，并传递给WriteContext

获取ParquetTypeInfo的详细过程如下：

  def fromPrimitiveDataType(ctype: DataType): Option[ParquetTypeInfo] = ctype match {    case StringType => Some(ParquetTypeInfo(      ParquetPrimitiveTypeName.BINARY, Some(ParquetOriginalType.UTF8)))    case BinaryType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.BINARY))    case BooleanType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.BOOLEAN))    case DoubleType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.DOUBLE))    case FloatType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FLOAT))    case IntegerType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT32))    // There is no type for Byte or Short so we promote them to INT32.    case ShortType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT32))    case ByteType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT32))    case LongType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT64))    case DecimalType.Fixed(precision, scale) if precision <= 18 =>      // TODO: for now, our writer only supports decimals that fit in a Long      Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY,        Some(ParquetOriginalType.DECIMAL),        Some(new DecimalMetadata(precision, scale)),        Some(BYTES_FOR_PRECISION(precision))))    case _ => None  }

然后将ParquetTypeInfo包装成MessageType

  def convertFromAttributes(attributes: Seq[Attribute]): MessageType = {    val fields = attributes.map(      attribute =>        fromDataType(attribute.dataType, attribute.name, attribute.nullable))    new MessageType("root", fields)  }

RecordWriter

0 0