FileSourceScanExec Leaf Physical Operator

FileSourceScanExec is a DataSourceScanExec (and so indirectly a leaf physical operator) that…​FIXME

FileSourceScanExec is created when FileSourceStrategy execution planning strategy resolves LogicalRelation logical operators.

val q = spark.read.option("header", true).csv("../datasets/people.csv")
val logicalPlan = q.queryExecution.logical
scala> println(logicalPlan.numberedTreeString)
00 Relation[id#63,name#64,age#65] csv

import org.apache.spark.sql.execution.datasources.FileSourceStrategy
val sparkPlan = FileSourceStrategy(logicalPlan).head
scala> println(sparkPlan.numberedTreeString)
00 FileScan csv [id#63,name#64,age#65] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/jacek/dev/oss/datasets/people.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,name:string,age:string>

import org.apache.spark.sql.execution.FileSourceScanExec
val fileScanExec = sparkPlan.asInstanceOf[FileSourceScanExec]

FileSourceScanExec supports ColumnarBatchScan.

FileSourceScanExec always gives inputRDD as the only RDD that generates internal rows (when WholeStageCodegenExec is executed).

nodeNamePrefix is File (and is used for the simple node description).

val fileScanExec: FileSourceScanExec = ... // see the example earlier
scala> fileScanExec.nodeNamePrefix
res1: String = File

scala> fileScanExec.simpleString
res2: String = FileScan csv [id#63,name#64,age#65] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/jacek/dev/oss/datasets/people.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,name:string,age:string>
Table 1. FileSourceScanExec’s Performance Metrics
Key Name (in web UI) Description

metadataTime

metadata time (ms)

numFiles

number of files

numOutputRows

number of output rows

scanTime

scan time

spark sql FileSourceScanExec webui query details.png
Figure 1. FileSourceScanExec in web UI (Details for Query)
Caution
FIXME Why is the node name of FileSourceScanExec in the diagram above without File nodeNamePrefix?
Table 2. FileSourceScanExec’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

inputRDD

RDD of internal binary rows (i.e. InternalRow)

Used when FileSourceScanExec is requested for inputRDDs and execution.

metadata

Metadata (as a collection of key-value pairs)

Note
metadata is a part of DataSourceScanExec Contract to..FIXME.

needsUnsafeRowConversion

pushedDownFilters

supportsBatch

Tip

Enable INFO logging level for org.apache.spark.sql.execution.FileSourceScanExec logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.sql.execution.FileSourceScanExec=INFO

Refer to Logging.

vectorTypes Method

vectorTypes: Option[Seq[String]]
Note
vectorTypes is a part of ColumnarBatchScan Contract to..FIXME.

vectorTypes…​FIXME

Executing FileSourceScanExec — doExecute Method

doExecute(): RDD[InternalRow]
Note
doExecute is a part of SparkPlan Contract to produce the result of a structured query as an RDD of internal binary rows.

doExecute…​FIXME

Generating Java Source Code — doProduce Method

doProduce(ctx: CodegenContext): String
Note
doProduce is a part of CodegenSupport Contract to generate a Java source code for…​FIXME

doProduce…​FIXME

createBucketedReadRDD Internal Method

createBucketedReadRDD(
  bucketSpec: BucketSpec,
  readFile: (PartitionedFile) => Iterator[InternalRow],
  selectedPartitions: Seq[PartitionDirectory],
  fsRelation: HadoopFsRelation): RDD[InternalRow]

createBucketedReadRDD…​FIXME

Note
createBucketedReadRDD is used when…​FIXME

createNonBucketedReadRDD Internal Method

createNonBucketedReadRDD(
  readFile: (PartitionedFile) => Iterator[InternalRow],
  selectedPartitions: Seq[PartitionDirectory],
  fsRelation: HadoopFsRelation): RDD[InternalRow]

createNonBucketedReadRDD…​FIXME

Note
createNonBucketedReadRDD is used when…​FIXME

selectedPartitions Internal Lazy-Initialized Property

selectedPartitions: Seq[PartitionDirectory]

selectedPartitions…​FIXME

Note

selectedPartitions is used when FileSourceScanExec calculates:

Creating FileSourceScanExec Instance

FileSourceScanExec takes the following when created:

FileSourceScanExec initializes the internal registries and counters.

results matching ""

    No results matching ""