Catalyst Expression — Executable Node in Catalyst Tree

Expression is a executable node (in a Catalyst tree) that can evaluate a result value given input values, i.e. can produce a JVM object per InternalRow.

Note
Expression is often called a Catalyst expression even though it is merely built using (not be part of) the Catalyst — Tree Manipulation Framework.
// evaluating an expression
// Use Literal expression to create an expression from a Scala object
import org.apache.spark.sql.catalyst.expressions.Expression
import org.apache.spark.sql.catalyst.expressions.Literal
val e: Expression = Literal("hello")

import org.apache.spark.sql.catalyst.expressions.EmptyRow
val v: Any = e.eval(EmptyRow)

// Convert to Scala's String
import org.apache.spark.unsafe.types.UTF8String
scala> val s = v.asInstanceOf[UTF8String].toString
s: String = hello

Expression can generate a Java source code that is then used in evaluation.

Expression is deterministic when always evaluates the same result for the same inputs. By default, an expression is deterministic if all the child expressions are (which for leaf expressions with no child expressions is true).

Note
A deterministic expression is like a pure function in functional programming languages.
val e = $"a".expr
scala> :type e
org.apache.spark.sql.catalyst.expressions.Expression

scala> println(e.deterministic)
true

verboseString is…​FIXME

Table 1. Specialized Expressions
Name Scala Kind Behaviour Examples

BinaryExpression

abstract class

CodegenFallback

trait

Does not support code generation and falls back to interpreted mode

ExpectsInputTypes

trait

ExtractValue

trait

Marks UnresolvedAliases to be resolved to Aliases with "pretty" SQLs when ResolveAliases is executed

LeafExpression

abstract class

Has no child expressions (and hence "terminates" the expression tree).

NamedExpression

Can later be referenced in a dataflow graph.

Nondeterministic

trait

NonSQLExpression

trait

Expression with no SQL representation

Gives the only custom sql method that is non-overridable (i.e. final).

When requested SQL representation, NonSQLExpression transforms Attributes to be PrettyAttributes to build text representation.

Predicate

trait

Result data type is always boolean

TernaryExpression

abstract class

TimeZoneAwareExpression

trait

Timezone-aware expressions

UnaryExpression

abstract class

Unevaluable

trait

Cannot be evaluated to produce a value (neither in interpreted nor code-generated expression evaluations), i.e. eval and doGenCode are not supported and simply report an UnsupportedOperationException.

Unevaluable expressions have to be resolved (replaced) to some other expressions or logical operators at analysis or optimization phases or they fail analysis.

/**
Example: Analysis failure due to an Unevaluable expression
UnresolvedFunction is an Unevaluable expression
Using Catalyst DSL to create a UnresolvedFunction
*/
import org.apache.spark.sql.catalyst.dsl.expressions._
val f = 'f.function()

import org.apache.spark.sql.catalyst.dsl.plans._
val logicalPlan = table("t1").select(f)
scala> println(logicalPlan.numberedTreeString)
00 'Project [unresolvedalias('f(), None)]
01 +- 'UnresolvedRelation `t1`

scala> spark.sessionState.analyzer.execute(logicalPlan)
org.apache.spark.sql.AnalysisException: Undefined function: 'f'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.;
  at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1198)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1198)
  at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1197)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1195)

Expression Contract

package org.apache.spark.sql.catalyst.expressions

abstract class Expression extends TreeNode[Expression] {
  // only required methods that have no implementation
  def dataType: DataType
  def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode
  def eval(input: InternalRow = EmptyRow): Any
  def nullable: Boolean
}
Table 2. (Subset of) Expression Contract
Method Description

canonicalized

checkInputDataTypes

Verifies (checks the correctness of) the input data types

childrenResolved

dataType

Data type of the result of evaluating an expression

doGenCode

Code-generated expression evaluation that generates a Java source code (that is used to evaluate the expression in a more optimized way not directly using eval).

Used when Expression is requested to genCode.

eval

Interpreted (non-code-generated) expression evaluation that evaluates an expression to a JVM object for a given internal binary row (without generating a corresponding Java code.)

Note
By default accepts EmptyRow, i.e. null.

eval is a slower "relative" of the code-generated (non-interpreted) expression evaluation.

foldable

genCode

Generates the Java source code for code-generated (non-interpreted) expression evaluation (on an input internal row in a more optimized way not directly using eval).

Similar to doGenCode but supports expression reuse (aka subexpression elimination).

genCode is a faster "relative" of the interpreted (non-code-generated) expression evaluation.

nullable

prettyName

User-facing name

references

resolved

semanticEquals

semanticHash

reduceCodeSize Internal Method

reduceCodeSize(ctx: CodegenContext, eval: ExprCode): Unit

reduceCodeSize does its work only when all of the following are met:

  1. Length of the generate code is above 1024

  2. INPUT_ROW of the input CodegenContext is defined

  3. currentVars of the input CodegenContext is not defined

Caution
FIXME When would the above not be met? What’s so special about such an expression?

reduceCodeSize sets the value of the input ExprCode to the fresh term name for the value name.

In the end, reduceCodeSize sets the code of the input ExprCode to the following:

[javaType] [newValue] = [funcFullName]([INPUT_ROW]);

The funcFullName is the fresh term name for the name of the current expression node.

Tip
Use the expression node name to search for the function that corresponds to the expression in a generated code.
Note
reduceCodeSize is used exclusively when Expression is requested to generate the Java source code for code-generated expression evaluation.

flatArguments Method

flatArguments: Iterator[Any]

flatArguments…​FIXME

Note
flatArguments is used when…​FIXME

SQL Representation — sql Method

sql: String

sql gives a SQL representation.

Internally, sql gives a text representation with prettyName followed by sql of children in the round brackets and concatenated using the comma (,).

import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.expressions.Sentences
val sentences = Sentences("Hi there! Good morning.", "en", "US")

import org.apache.spark.sql.catalyst.expressions.Expression
val expr: Expression = count("*") === 5 && count(sentences) === 5
scala> expr.sql
res0: String = ((count('*') = 5) AND (count(sentences('Hi there! Good morning.', 'en', 'US')) = 5))
Note
sql is used when…​FIXME

results matching ""

    No results matching ""