buildReader(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
FileFormat — Data Sources to Read and Write Data In Files
FileFormat
is the contract for data sources that read and write data stored in files.
Method | Description |
---|---|
|
Builds a Catalyst data reader, i.e. a function that reads a PartitionedFile file as InternalRows.
Used exclusively when |
|
|
|
Infers (returns) the schema of the given files (as Hadoop’s FileStatuses) if supported. Otherwise, Used when:
|
|
Controls whether the format (under the given path as Hadoop Path) can be split or not.
Used exclusively when |
|
Prepares a write job and returns an Used exclusively when |
|
Flag that says whether the format supports vectorized decoding (aka columnar batch) or not. Default: Used exclusively when |
|
Defines the fully-qualified class names (types) of the concrete ColumnVectors for every column in the input Default: undefined ( Used exclusively when |
FileFormat | Description |
---|---|
Building Data Reader With Partition Column Values Appended — buildReaderWithPartitionValues
Method
buildReaderWithPartitionValues(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
buildReaderWithPartitionValues
is simply an enhanced buildReader that appends partition column values to the internal rows produced by the reader function from buildReader.
Internally, buildReaderWithPartitionValues
builds a data reader with the input parameters and gives a data reader function (of a PartitionedFile to an Iterator[InternalRow]
) that does the following:
-
Creates a converter by requesting
GenerateUnsafeProjection
to generate an UnsafeProjection for the attributes of the inputrequiredSchema
andpartitionSchema
-
Applies the data reader to a
PartitionedFile
and converts the result using the converter on the joined row with the partition column values appended.
Note
|
buildReaderWithPartitionValues is used exclusively when FileSourceScanExec physical operator is requested for the input RDDs.
|