FileFormat — Data Sources to Read and Write Data In Files

FileFormat is the contract for data sources that read and write data stored in files.

Table 1. FileFormat Contract
Method Description

buildReader

buildReader(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

Builds a Catalyst data reader, i.e. a function that reads a PartitionedFile file as InternalRows.

buildReader throws an UnsupportedOperationException by default (and should therefore be overriden to work):

buildReader is not supported for [this]

Used exclusively when FileFormat is requested to buildReaderWithPartitionValues

buildReaderWithPartitionValues

buildReaderWithPartitionValues(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues builds a data reader with partition column values appended, i.e. a function that is used to read a single file in (as a PartitionedFile) as an Iterator of InternalRows (like buildReader) with the partition values appended.

Used exclusively when FileSourceScanExec physical operator is requested for the inputRDD (when requested for the inputRDDs and execution)

inferSchema

inferSchema(
  sparkSession: SparkSession,
  options: Map[String, String],
  files: Seq[FileStatus]): Option[StructType]

Infers (returns) the schema of the given files (as Hadoop’s FileStatuses) if supported. Otherwise, None should be returned.

Used when:

isSplitable

isSplitable(
  sparkSession: SparkSession,
  options: Map[String, String],
  path: Path): Boolean

Controls whether the format (under the given path as Hadoop Path) can be split or not.

isSplitable is disabled (false) by default.

Used exclusively when FileSourceScanExec physical operator is requested to create an RDD for non-bucketed reads (when requested for the inputRDD and neither the optional bucketing specification of the HadoopFsRelation is defined nor bucketing is enabled)

prepareWrite

prepareWrite(
  sparkSession: SparkSession,
  job: Job,
  options: Map[String, String],
  dataSchema: StructType): OutputWriterFactory

Prepares a write job and returns an OutputWriterFactory

Used exclusively when FileFormatWriter is requested to write query result

supportBatch

supportBatch(
  sparkSession: SparkSession,
  dataSchema: StructType): Boolean

Flag that says whether the format supports vectorized decoding (aka columnar batch) or not.

Default: false

Used exclusively when FileSourceScanExec physical operator is requested for the supportsBatch

vectorTypes

vectorTypes(
  requiredSchema: StructType,
  partitionSchema: StructType,
  sqlConf: SQLConf): Option[Seq[String]]

Defines the fully-qualified class names (types) of the concrete ColumnVectors for every column in the input requiredSchema and partitionSchema schemas that are used in a columnar batch.

Default: undefined (None)

Used exclusively when FileSourceScanExec leaf physical operator is requested for the vectorTypes

Table 2. FileFormats (Direct Implementations and Extensions)
FileFormat Description

AvroFileFormat

Avro data source

HiveFileFormat

Writes hive tables

OrcFileFormat

ORC data source

ParquetFileFormat

Parquet data source

TextBasedFileFormat

Base for text splitable FileFormats

Building Data Reader With Partition Column Values Appended — buildReaderWithPartitionValues Method

buildReaderWithPartitionValues(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues is simply an enhanced buildReader that appends partition column values to the internal rows produced by the reader function from buildReader.

Internally, buildReaderWithPartitionValues builds a data reader with the input parameters and gives a data reader function (of a PartitionedFile to an Iterator[InternalRow]) that does the following:

  1. Creates a converter by requesting GenerateUnsafeProjection to generate an UnsafeProjection for the attributes of the input requiredSchema and partitionSchema

  2. Applies the data reader to a PartitionedFile and converts the result using the converter on the joined row with the partition column values appended.

Note
buildReaderWithPartitionValues is used exclusively when FileSourceScanExec physical operator is requested for the input RDDs.

results matching ""

    No results matching ""