import org.apache.spark.sql.execution.datasources.PartitionedFile
import org.apache.spark.sql.catalyst.InternalRow
val partFile = PartitionedFile(InternalRow.empty, "fakePath0", 0, 10, Array("host0", "host1"))
PartitionedFile — File Block in FileFormat Data Source
PartitionedFile
is a part (block) of a file that is in a sense similar to a Pqruet block or a HDFS split.
PartitionedFile
represents a chunk of a file that will be read, along with partition column values appended to each row, in a partition.
Note
|
Partition column values are values of the columns that are column partitions and therefore part of the directory structure not the partitioned files themselves (that together are the partitioned dataset). |
PartitionedFile
is created exclusively when FileSourceScanExec
is requested to create the input RDD for bucketed or non-bucketed reads.
PartitionedFile
takes the following to be created:
-
Partition column values to be appended to each row (as an internal row)
-
Locality information that is a list of nodes (by their host names) that have the data (
Array[String]
). Default: empty
PartitionedFile
uses the following text representation (toString
):
path: [filePath], range: [start]-[end], partition values: [partitionValues]
scala> :type partFile
org.apache.spark.sql.execution.datasources.PartitionedFile
scala> println(partFile)
path: fakePath0, range: 0-10, partition values: [empty row]