Powered by GitBook

FileFormat — Data Sources to Read and Write Data In Files

FileFormat is the contract for data sources that read and write data stored in files.

Table 1. FileFormat Contract
Method	Description
`buildReader`	`buildReader( sparkSession: SparkSession, dataSchema: StructType, partitionSchema: StructType, requiredSchema: StructType, filters: Seq[Filter], options: Map[String, String], hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]` Builds a Catalyst data reader, i.e. a function that reads a PartitionedFile file as InternalRows. `buildReader` throws an `UnsupportedOperationException` by default (and should therefore be overriden to work): `buildReader is not supported for [this]` Used exclusively when `FileFormat` is requested to buildReaderWithPartitionValues
`buildReaderWithPartitionValues`	`buildReaderWithPartitionValues( sparkSession: SparkSession, dataSchema: StructType, partitionSchema: StructType, requiredSchema: StructType, filters: Seq[Filter], options: Map[String, String], hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]` `buildReaderWithPartitionValues` builds a data reader with partition column values appended, i.e. a function that is used to read a single file in (as a PartitionedFile) as an `Iterator` of InternalRows (like buildReader) with the partition values appended. Used exclusively when `FileSourceScanExec` physical operator is requested for the inputRDD (when requested for the inputRDDs and execution)
`inferSchema`	`inferSchema( sparkSession: SparkSession, options: Map[String, String], files: Seq[FileStatus]): Option[StructType]` Infers (returns) the schema of the given files (as Hadoop’s FileStatuses) if supported. Otherwise, `None` should be returned. Used when: `HiveMetastoreCatalog` is requested to inferIfNeeded (when `RelationConversions` logical evaluation rule is requested to convert a HiveTableRelation to a LogicalRelation for `parquet`, `native` and `hive` ORC storage formats) `DataSource` is requested to getOrInferFileFormatSchema and resolveRelation
`isSplitable`	`isSplitable( sparkSession: SparkSession, options: Map[String, String], path: Path): Boolean` Controls whether the format (under the given path as Hadoop Path) can be split or not. `isSplitable` is disabled (`false`) by default. Used exclusively when `FileSourceScanExec` physical operator is requested to create an RDD for non-bucketed reads (when requested for the inputRDD and neither the optional bucketing specification of the HadoopFsRelation is defined nor bucketing is enabled)
`prepareWrite`	`prepareWrite( sparkSession: SparkSession, job: Job, options: Map[String, String], dataSchema: StructType): OutputWriterFactory` Prepares a write job and returns an `OutputWriterFactory` Used exclusively when `FileFormatWriter` is requested to write query result
`supportBatch`	`supportBatch( sparkSession: SparkSession, dataSchema: StructType): Boolean` Flag that says whether the format supports vectorized decoding (aka columnar batch) or not. Default: `false` Used exclusively when `FileSourceScanExec` physical operator is requested for the supportsBatch
`vectorTypes`	`vectorTypes( requiredSchema: StructType, partitionSchema: StructType, sqlConf: SQLConf): Option[Seq[String]]` Defines the fully-qualified class names (types) of the concrete ColumnVectors for every column in the input `requiredSchema` and `partitionSchema` schemas that are used in a columnar batch. Default: undefined (`None`) Used exclusively when `FileSourceScanExec` leaf physical operator is requested for the vectorTypes

Table 2. FileFormats (Direct Implementations and Extensions)
FileFormat	Description
AvroFileFormat	Avro data source
HiveFileFormat	Writes hive tables
OrcFileFormat	ORC data source
ParquetFileFormat	Parquet data source
TextBasedFileFormat	Base for text splitable `FileFormats`

Building Data Reader With Partition Column Values Appended — `buildReaderWithPartitionValues` Method

buildReaderWithPartitionValues(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues is simply an enhanced buildReader that appends partition column values to the internal rows produced by the reader function from buildReader.

Internally, buildReaderWithPartitionValues builds a data reader with the input parameters and gives a data reader function (of a PartitionedFile to an Iterator[InternalRow]) that does the following:

Creates a converter by requesting GenerateUnsafeProjection to generate an UnsafeProjection for the attributes of the input requiredSchema and partitionSchema
Applies the data reader to a PartitionedFile and converts the result using the converter on the joined row with the partition column values appended.

Note	`buildReaderWithPartitionValues` is used exclusively when `FileSourceScanExec` physical operator is requested for the input RDDs.

results matching ""

No results matching ""