PartitionedFile — File Block in FileFormat Data Source

PartitionedFile is a part (block) of a file that is in a sense similar to a Pqruet block or a HDFS split.

PartitionedFile represents a chunk of a file that will be read, along with partition column values appended to each row, in a partition.

Note
Partition column values are values of the columns that are column partitions and therefore part of the directory structure not the partitioned files themselves (that together are the partitioned dataset).

PartitionedFile is created exclusively when FileSourceScanExec is requested to create the input RDD for bucketed or non-bucketed reads.

PartitionedFile takes the following to be created:

  • Partition column values to be appended to each row (as an internal row)

  • Path of the file to read

  • Beginning offset (in bytes)

  • Number of bytes to read (aka length)

  • Locality information that is a list of nodes (by their host names) that have the data (Array[String]). Default: empty

import org.apache.spark.sql.execution.datasources.PartitionedFile
import org.apache.spark.sql.catalyst.InternalRow

val partFile = PartitionedFile(InternalRow.empty, "fakePath0", 0, 10, Array("host0", "host1"))

PartitionedFile uses the following text representation (toString):

path: [filePath], range: [start]-[end], partition values: [partitionValues]
scala> :type partFile
org.apache.spark.sql.execution.datasources.PartitionedFile

scala> println(partFile)
path: fakePath0, range: 0-10, partition values: [empty row]

results matching ""

    No results matching ""