FileScanRDD — Input RDD of FileSourceScanExec Physical Operator

FileScanRDD is an RDD of internal binary rows (i.e. RDD[InternalRow]) that is used as the input RDD of a FileSourceScanExec physical operator (in Whole-Stage Java Code Generation).

FileScanRDD is created exclusively when FileSourceScanExec physical operator is requested to createBucketedReadRDD or createNonBucketedReadRDD (when FileSourceScanExec operator is requested for the input RDD when WholeStageCodegenExec physical operator is executed).

val q = spark.read.text("README.md")

val sparkPlan = q.queryExecution.executedPlan
import org.apache.spark.sql.execution.FileSourceScanExec
val scan = sparkPlan.collectFirst { case exec: FileSourceScanExec => exec }.get
val inputRDD = scan.inputRDDs.head

val rdd = q.queryExecution.toRdd
scala> println(rdd.toDebugString)
(1) MapPartitionsRDD[1] at toRdd at <console>:26 []
 |  FileScanRDD[0] at toRdd at <console>:26 []

val fileScanRDD = q.queryExecution.toRdd.dependencies.head.rdd

// What FileSourceScanExec uses for the input RDD is exactly the first RDD in the lineage
assert(inputRDD == fileScanRDD)
Table 1. FileScanRDD’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

ignoreCorruptFiles

spark.sql.files.ignoreCorruptFiles

Used exclusively when FileScanRDD is requested to compute a partition

ignoreMissingFiles

spark.sql.files.ignoreMissingFiles

Used exclusively when FileScanRDD is requested to compute a partition

getPreferredLocations Method

getPreferredLocations(split: RDDPartition): Seq[String]
Note
getPreferredLocations is part of the RDD Contract to…​FIXME.

getPreferredLocations…​FIXME

getPartitions Method

getPartitions: Array[RDDPartition]
Note
getPartitions is part of the RDD Contract to…​FIXME.

getPartitions…​FIXME

Creating FileScanRDD Instance

FileScanRDD takes the following when created:

Computing Partition (in TaskContext) — compute Method

compute(split: RDDPartition, context: TaskContext): Iterator[InternalRow]
Note
compute is part of Spark Core’s RDD Contract to compute a partition (in a TaskContext).

compute creates a Scala Iterator (of Java Objects) that…​FIXME

compute then requests the input TaskContext to register a completion listener to be executed when a task completes (i.e. addTaskCompletionListener) that simply closes the iterator.

In the end, compute returns the iterator.

results matching ""

    No results matching ""