val q = sql("""
SELECT customer, year, SUM(sales)
FROM VALUES ("abc", 2017, 30) AS t1 (customer, year, sales)
GROUP BY customer, year
GROUPING SETS ((customer), (year))
""")
scala> println(q.queryExecution.logical.numberedTreeString)
00 'GroupingSets [ArrayBuffer('customer), ArrayBuffer('year)], ['customer, 'year], ['customer, 'year, unresolvedalias('SUM('sales), None)]
01 +- 'SubqueryAlias t1
02 +- 'UnresolvedInlineTable [customer, year, sales], [List(abc, 2017, 30)]
GroupingSets Unary Logical Operator
GroupingSets
is a unary logical operator that represents SQL’s GROUPING SETS variant of GROUP BY
clause.
GroupingSets
operator is resolved to an Aggregate
logical operator at analysis phase.
scala> println(q.queryExecution.analyzed.numberedTreeString)
00 Aggregate [customer#8, year#9, spark_grouping_id#5], [customer#8, year#9, sum(cast(sales#2 as bigint)) AS sum(sales)#4L]
01 +- Expand [List(customer#0, year#1, sales#2, customer#6, null, 1), List(customer#0, year#1, sales#2, null, year#7, 2)], [customer#0, year#1, sales#2, customer#8, year#9, spark_grouping_id#5]
02 +- Project [customer#0, year#1, sales#2, customer#0 AS customer#6, year#1 AS year#7]
03 +- SubqueryAlias t1
04 +- LocalRelation [customer#0, year#1, sales#2]
Note
|
GroupingSets can only be created using SQL.
|
Note
|
GroupingSets is not supported on Structured Streaming’s streaming Datasets.
|
GroupingSets
is never resolved (as it can only be converted to an Aggregate
logical operator).
The output schema of a GroupingSets
are exactly the attributes of aggregate named expressions.
Analysis Phase
GroupingSets
operator is resolved at analysis phase in the following logical evaluation rules:
-
ResolveAliases for unresolved aliases in aggregate named expressions
val spark: SparkSession = ...
// using q from the example above
val plan = q.queryExecution.logical
scala> println(plan.numberedTreeString)
00 'GroupingSets [ArrayBuffer('customer), ArrayBuffer('year)], ['customer, 'year], ['customer, 'year, unresolvedalias('SUM('sales), None)]
01 +- 'SubqueryAlias t1
02 +- 'UnresolvedInlineTable [customer, year, sales], [List(abc, 2017, 30)]
// Note unresolvedalias for SUM expression
// Note UnresolvedInlineTable and SubqueryAlias
// FIXME Show the evaluation rules to get rid of the unresolvable parts
Creating GroupingSets Instance
GroupingSets
takes the following when created:
-
Expressions from
GROUPING SETS
clause -
Grouping expressions from
GROUP BY
clause -
Child logical plan
-
Aggregate named expressions