DataSourceStrategy Execution Planning Strategy

DataSourceStrategy is an execution planning strategy (of SparkPlanner) that converts LogicalRelation logical operator to RowDataSourceScanExec physical operator.

Table 1. DataSourceStrategy’s Selection Requirements (in execution order)
Logical Operator Selection Requirements

LogicalRelation with CatalystScan relation

Uses pruneFilterProjectRaw

CatalystScan does not seem to be used in Spark SQL.

LogicalRelation with PrunedFilteredScan

Note
Matches JDBCRelation exclusively (as a PrunedFilteredScan)

LogicalRelation with PrunedScan

Note
PrunedScan does not seem to be used in Spark SQL.

LogicalRelation with TableScan relation

Matches KafkaRelation exclusively (as it is TableScan)

import org.apache.spark.sql.execution.datasources.DataSourceStrategy
val strategy = DataSourceStrategy(spark.sessionState.conf)

import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
val plan: LogicalPlan = ???

val sparkPlan = strategy(plan).head
Note
DataSourceStrategy uses PhysicalOperation Scala class to destructure a logical plan.

pruneFilterProject Internal Method

pruneFilterProject(
  relation: LogicalRelation,
  projects: Seq[NamedExpression],
  filterPredicates: Seq[Expression],
  scanBuilder: (Seq[Attribute], Array[Filter]) => RDD[InternalRow])

pruneFilterProject simply calls pruneFilterProjectRaw with scanBuilder ignoring the Seq[Expression] input parameter.

Note
pruneFilterProject is used when DataSourceStrategy plans a LogicalRelation with PrunedFilteredScan or PrunedScan scans.

Creating RowDataSourceScanExec (under FilterExec and ProjectExec) — pruneFilterProjectRaw Internal Method

pruneFilterProjectRaw(
  relation: LogicalRelation,
  projects: Seq[NamedExpression],
  filterPredicates: Seq[Expression],
  scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter]) => RDD[InternalRow]): SparkPlan

pruneFilterProjectRaw creates a RowDataSourceScanExec (possibly as a child of FilterExec that in turn could be a child of ProjectExec).

Note
pruneFilterProjectRaw is used when DataSourceStrategy executes (and selects RowDataSourceScanExec per LogicalRelation).

results matching ""

    No results matching ""