// evaluating an expression
// Use Literal expression to create an expression from a Scala object
import org.apache.spark.sql.catalyst.expressions.Expression
import org.apache.spark.sql.catalyst.expressions.Literal
val e: Expression = Literal("hello")
import org.apache.spark.sql.catalyst.expressions.EmptyRow
val v: Any = e.eval(EmptyRow)
// Convert to Scala's String
import org.apache.spark.unsafe.types.UTF8String
scala> val s = v.asInstanceOf[UTF8String].toString
s: String = hello
Catalyst Expression — Executable Node in Catalyst Tree
Expression
is a executable node (in a Catalyst tree) that can evaluate a result value given input values, i.e. can produce a JVM object per InternalRow
.
Note
|
Expression is often called a Catalyst expression even though it is merely built using (not be part of) the Catalyst — Tree Manipulation Framework.
|
Expression
can generate a Java source code that is then used in evaluation.
Expression
is deterministic when evaluates to the same result for the same input(s). An expression is deterministic if all the child expressions are (which for leaf expressions with no child expressions is always true).
Note
|
A deterministic expression is like a pure function in functional programming languages. |
val e = $"a".expr
scala> :type e
org.apache.spark.sql.catalyst.expressions.Expression
scala> println(e.deterministic)
true
Note
|
Non-deterministic expressions are not allowed in some logical operators and are excluded in some optimizations. |
verboseString
is…FIXME
Name | Scala Kind | Behaviour | Examples |
---|---|---|---|
abstract class |
|||
trait |
Does not support code generation and falls back to interpreted mode |
||
trait |
|||
trait |
Marks |
||
abstract class |
Has no child expressions (and hence "terminates" the expression tree). |
||
Can later be referenced in a dataflow graph. |
|||
trait |
|||
trait |
Expression with no SQL representation Gives the only custom sql method that is non-overridable (i.e. When requested SQL representation, |
||
trait |
|||
abstract class |
|||
trait |
Timezone-aware expressions |
||
abstract class |
|||
|
trait |
Cannot be evaluated to produce a value (neither in interpreted nor code-generated expression evaluations), i.e. eval and doGenCode are not supported and simply report an
|
Expression Contract
package org.apache.spark.sql.catalyst.expressions
abstract class Expression extends TreeNode[Expression] {
// only required methods that have no implementation
def dataType: DataType
def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode
def eval(input: InternalRow = EmptyRow): Any
def nullable: Boolean
}
Method | Description | ||
---|---|---|---|
Verifies (checks the correctness of) the input data types |
|||
Data type of the result of evaluating an expression |
|||
Code-generated expression evaluation that generates a Java source code (that is used to evaluate the expression in a more optimized way not directly using eval). Used when |
|||
Interpreted (non-code-generated) expression evaluation that evaluates an expression to a JVM object for a given internal binary row (without generating a corresponding Java code.)
|
|||
Generates the Java source code for code-generated (non-interpreted) expression evaluation (on an input internal row in a more optimized way not directly using eval). Similar to doGenCode but supports expression reuse (aka subexpression elimination).
|
|||
User-facing name |
|||
reduceCodeSize
Internal Method
reduceCodeSize(ctx: CodegenContext, eval: ExprCode): Unit
reduceCodeSize
does its work only when all of the following are met:
-
Length of the generate code is above 1024
-
INPUT_ROW of the input
CodegenContext
is defined -
currentVars of the input
CodegenContext
is not defined
Caution
|
FIXME When would the above not be met? What’s so special about such an expression? |
reduceCodeSize
sets the value
of the input ExprCode
to the fresh term name for the value
name.
In the end, reduceCodeSize
sets the code of the input ExprCode
to the following:
[javaType] [newValue] = [funcFullName]([INPUT_ROW]);
The funcFullName
is the fresh term name for the name of the current expression node.
Tip
|
Use the expression node name to search for the function that corresponds to the expression in a generated code. |
Note
|
reduceCodeSize is used exclusively when Expression is requested to generate the Java source code for code-generated expression evaluation.
|
flatArguments
Method
flatArguments: Iterator[Any]
flatArguments
…FIXME
Note
|
flatArguments is used when…FIXME
|
SQL Representation — sql
Method
sql: String
sql
gives a SQL representation.
Internally, sql
gives a text representation with prettyName followed by sql
of children in the round brackets and concatenated using the comma (,
).
import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.expressions.Sentences
val sentences = Sentences("Hi there! Good morning.", "en", "US")
import org.apache.spark.sql.catalyst.expressions.Expression
val expr: Expression = count("*") === 5 && count(sentences) === 5
scala> expr.sql
res0: String = ((count('*') = 5) AND (count(sentences('Hi there! Good morning.', 'en', 'US')) = 5))
Note
|
sql is used when…FIXME
|