DataSourceScanExec Contract — Leaf Physical Operators to Scan Over BaseRelation

DataSourceScanExec is the contract of leaf physical operators that represent scans over BaseRelation.

Note
There are two DataSourceScanExecs, i.e. FileSourceScanExec and RowDataSourceScanExec, with a scan over data in HadoopFsRelation and generic BaseRelation relations, respectively.

DataSourceScanExec supports Java code generation (aka codegen)

package org.apache.spark.sql.execution

trait DataSourceScanExec extends LeafExecNode with CodegenSupport {
  // only required vals and methods that have no implementation
  // the others follow
  def metadata: Map[String, String]
  val relation: BaseRelation
  val tableIdentifier: Option[TableIdentifier]
}
Table 1. (Subset of) DataSourceScanExec Contract
Property Description

metadata

Metadata (as a collection of key-value pairs) that describes the scan when requested for the simple text representation.

relation

BaseRelation that is used in the node name and…​FIXME

tableIdentifier

Optional TableIdentifier

Note
The prefix for variable names for DataSourceScanExec operators in a generated Java source code is scan.

The default node name prefix is an empty string (that is used in the simple node description).

DataSourceScanExec uses the BaseRelation and the TableIdentifier as the node name in the following format:

Scan [relation] [tableIdentifier]
Table 2. DataSourceScanExecs
DataSourceScanExec Description

FileSourceScanExec

RowDataSourceScanExec

Simple (Basic) Text Node Description (in Query Plan Tree) — simpleString Method

simpleString: String
Note
simpleString is part of QueryPlan Contract to give the simple text description of a TreeNode in a query plan tree.

simpleString creates a text representation of every key-value entry in the metadata…​FIXME

Internally, simpleString sorts the metadata and concatenate the keys and the values (separated by the : `). While doing so, `simpleString redacts sensitive information in every value and abbreviates it to the first 100 characters.

simpleString uses Spark Core’s Utils to truncatedString.

In the end, simpleString returns a text representation that is made up of the nodeNamePrefix, the nodeName, the output (schema attributes) and the metadata and is of the following format:

[nodeNamePrefix][nodeName][[output]][metadata]
val scanExec = basicDataSourceScanExec
scala> println(scanExec.simpleString)
Scan $line143.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anon$1@57d94b26 [] PushedFilters: [], ReadSchema: struct<>

def basicDataSourceScanExec = {
  import org.apache.spark.sql.catalyst.expressions.AttributeReference
  val output = Seq.empty[AttributeReference]
  val requiredColumnsIndex = output.indices
  import org.apache.spark.sql.sources.Filter
  val filters, handledFilters = Set.empty[Filter]
  import org.apache.spark.sql.catalyst.InternalRow
  import org.apache.spark.sql.catalyst.expressions.UnsafeRow
  val row: InternalRow = new UnsafeRow(0)
  val rdd: RDD[InternalRow] = sc.parallelize(row :: Nil)

  import org.apache.spark.sql.sources.{BaseRelation, TableScan}
  val baseRelation: BaseRelation = new BaseRelation with TableScan {
    import org.apache.spark.sql.SQLContext
    val sqlContext: SQLContext = spark.sqlContext

    import org.apache.spark.sql.types.StructType
    val schema: StructType = new StructType()

    import org.apache.spark.rdd.RDD
    import org.apache.spark.sql.Row
    def buildScan(): RDD[Row] = ???
  }

  val tableIdentifier = None
  import org.apache.spark.sql.execution.RowDataSourceScanExec
  RowDataSourceScanExec(
    output, requiredColumnsIndex, filters, handledFilters, rdd, baseRelation, tableIdentifier)
}

verboseString Method

verboseString: String
Note
verboseString is part of QueryPlan Contract to…​FIXME.

verboseString simply returns the redacted sensitive information in verboseString (of the parent QueryPlan).

Text Representation of All Nodes in Tree — treeString Method

treeString(verbose: Boolean, addSuffix: Boolean): String
Note
treeString is part of TreeNode Contract to…​FIXME.

treeString simply returns the redacted sensitive information in the text representation of all nodes (in query plan tree) (of the parent TreeNode).

Redacting Sensitive Information — redact Internal Method

redact(text: String): String

redact…​FIXME

Note
redact is used when DataSourceScanExec is requested for the simple, verbose and tree text representations.

results matching ""

    No results matching ""