DataSourceScanExec Contract — Leaf Physical Operators to Scan Over BaseRelation

DataSourceScanExec is the contract of leaf physical operators that represent scans over BaseRelation.

Note	There are two DataSourceScanExecs, i.e. FileSourceScanExec and RowDataSourceScanExec, with a scan over data in HadoopFsRelation and generic BaseRelation relations, respectively.

DataSourceScanExec supports Java code generation (aka codegen)

package org.apache.spark.sql.execution

trait DataSourceScanExec extends LeafExecNode with CodegenSupport {
  // only required vals and methods that have no implementation
  // the others follow
  def metadata: Map[String, String]
  val relation: BaseRelation
  val tableIdentifier: Option[TableIdentifier]
}

Table 1. (Subset of) DataSourceScanExec Contract
Property	Description
`metadata`	Metadata (as a collection of key-value pairs) that describes the scan when requested for the simple text representation.
`relation`	BaseRelation that is used in the node name and…FIXME
`tableIdentifier`	Optional `TableIdentifier`

Note	The prefix for variable names for `DataSourceScanExec` operators in a generated Java source code is scan.

The default node name prefix is an empty string (that is used in the simple node description).

DataSourceScanExec uses the BaseRelation and the TableIdentifier as the node name in the following format:

Scan [relation] [tableIdentifier]

Table 2. DataSourceScanExecs
DataSourceScanExec	Description
FileSourceScanExec
RowDataSourceScanExec

Simple (Basic) Text Node Description (in Query Plan Tree) — `simpleString` Method

simpleString: String

Note	`simpleString` is part of QueryPlan Contract to give the simple text description of a `TreeNode` in a query plan tree.

simpleString creates a text representation of every key-value entry in the metadata…FIXME

Internally, simpleString sorts the metadata and concatenate the keys and the values (separated by the : `). While doing so, `simpleString redacts sensitive information in every value and abbreviates it to the first 100 characters.

simpleString uses Spark Core’s Utils to truncatedString.

In the end, simpleString returns a text representation that is made up of the nodeNamePrefix, the nodeName, the output (schema attributes) and the metadata and is of the following format:

[nodeNamePrefix][nodeName][[output]][metadata]

val scanExec = basicDataSourceScanExec
scala> println(scanExec.simpleString)
Scan $line143.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anon$1@57d94b26 [] PushedFilters: [], ReadSchema: struct<>

def basicDataSourceScanExec = {
  import org.apache.spark.sql.catalyst.expressions.AttributeReference
  val output = Seq.empty[AttributeReference]
  val requiredColumnsIndex = output.indices
  import org.apache.spark.sql.sources.Filter
  val filters, handledFilters = Set.empty[Filter]
  import org.apache.spark.sql.catalyst.InternalRow
  import org.apache.spark.sql.catalyst.expressions.UnsafeRow
  val row: InternalRow = new UnsafeRow(0)
  val rdd: RDD[InternalRow] = sc.parallelize(row :: Nil)

  import org.apache.spark.sql.sources.{BaseRelation, TableScan}
  val baseRelation: BaseRelation = new BaseRelation with TableScan {
    import org.apache.spark.sql.SQLContext
    val sqlContext: SQLContext = spark.sqlContext

    import org.apache.spark.sql.types.StructType
    val schema: StructType = new StructType()

    import org.apache.spark.rdd.RDD
    import org.apache.spark.sql.Row
    def buildScan(): RDD[Row] = ???
  }

  val tableIdentifier = None
  import org.apache.spark.sql.execution.RowDataSourceScanExec
  RowDataSourceScanExec(
    output, requiredColumnsIndex, filters, handledFilters, rdd, baseRelation, tableIdentifier)
}

`verboseString` Method

verboseString: String

Note	`verboseString` is part of QueryPlan Contract to…FIXME.

verboseString simply returns the redacted sensitive information in verboseString (of the parent QueryPlan).

Text Representation of All Nodes in Tree — `treeString` Method

treeString(verbose: Boolean, addSuffix: Boolean): String

Note	`treeString` is part of TreeNode Contract to…FIXME.

treeString simply returns the redacted sensitive information in the text representation of all nodes (in query plan tree) (of the parent TreeNode).

Redacting Sensitive Information — `redact` Internal Method

redact(text: String): String

redact…FIXME

Note	`redact` is used when `DataSourceScanExec` is requested for the simple, verbose and tree text representations.

DataSourceScanExec Contract — Leaf Physical Operators to Scan Over BaseRelation