Dataset API — Untyped Transformations

Untyped transformations are part of the Dataset API for transforming a Dataset to a DataFrame, a Column, a RelationalGroupedDataset, a DataFrameNaFunctions or a DataFrameStatFunctions (and hence untyped).

Note
Untyped transformations are the methods in the Dataset Scala class that are grouped in untypedrel group name, i.e. @group untypedrel.
Table 1. Dataset API’s Untyped Transformations
Transformation Description

agg

agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
agg(expr: Column, exprs: Column*): DataFrame
agg(exprs: Map[String, String]): DataFrame

apply

Selects a column based on the column name (i.e. maps a Dataset onto a Column)

apply(colName: String): Column

col

Selects a column based on the column name (i.e. maps a Dataset onto a Column)

col(colName: String): Column

colRegex

colRegex(colName: String): Column

Selects a column based on the column name specified as a regex (i.e. maps a Dataset onto a Column)

crossJoin

crossJoin(right: Dataset[_]): DataFrame

cube

cube(cols: Column*): RelationalGroupedDataset
cube(col1: String, cols: String*): RelationalGroupedDataset

drop

drop(colName: String): DataFrame
drop(colNames: String*): DataFrame
drop(col: Column): DataFrame

groupBy

groupBy(cols: Column*): RelationalGroupedDataset
groupBy(col1: String, cols: String*): RelationalGroupedDataset

join

join(right: Dataset[_]): DataFrame
join(right: Dataset[_], usingColumn: String): DataFrame
join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
join(right: Dataset[_], joinExprs: Column): DataFrame
join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

na

na: DataFrameNaFunctions

rollup

rollup(cols: Column*): RelationalGroupedDataset
rollup(col1: String, cols: String*): RelationalGroupedDataset

select

select(cols: Column*): DataFrame
select(col: String, cols: String*): DataFrame

selectExpr

selectExpr(exprs: String*): DataFrame

stat

stat: DataFrameStatFunctions

withColumn

withColumn(colName: String, col: Column): DataFrame

withColumnRenamed

withColumnRenamed(existingName: String, newName: String): DataFrame

agg Untyped Transformation

agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
agg(expr: Column, exprs: Column*): DataFrame
agg(exprs: Map[String, String]): DataFrame

agg…​FIXME

apply Untyped Transformation

apply(colName: String): Column

apply selects a column based on the column name (i.e. maps a Dataset onto a Column).

col Untyped Transformation

col(colName: String): Column

col selects a column based on the column name (i.e. maps a Dataset onto a Column).

Internally, col branches off per the input column name.

If the column name is * (a star), col simply creates a Column with ResolvedStar expression (with the schema output attributes of the analyzed logical plan of the QueryExecution).

Otherwise, col uses colRegex untyped transformation when spark.sql.parser.quotedRegexColumnNames configuration property is enabled.

In the case when the column name is not * and spark.sql.parser.quotedRegexColumnNames configuration property is disabled, col creates a Column with the column name resolved (as a NamedExpression).

colRegex Untyped Transformation

colRegex(colName: String): Column

colRegex selects a column based on the column name specified as a regex (i.e. maps a Dataset onto a Column).

Note
colRegex is used in col when spark.sql.parser.quotedRegexColumnNames configuration property is enabled (and the column name is not *).

Internally, colRegex matches the input column name to different regular expressions (in the order):

  1. For column names with quotes without a qualifier, colRegex simply creates a Column with a UnresolvedRegex (with no table)

  2. For column names with quotes with a qualifier, colRegex simply creates a Column with a UnresolvedRegex (with a table specified)

  3. For other column names, colRegex (behaves like col and) creates a Column with the column name resolved (as a NamedExpression)

crossJoin Untyped Transformation

crossJoin(right: Dataset[_]): DataFrame

crossJoin…​FIXME

cube Untyped Transformation

cube(cols: Column*): RelationalGroupedDataset
cube(col1: String, cols: String*): RelationalGroupedDataset

cube…​FIXME

Dropping One or More Columns — drop Untyped Transformation

drop(colName: String): DataFrame
drop(colNames: String*): DataFrame
drop(col: Column): DataFrame

drop…​FIXME

groupBy Untyped Transformation

groupBy(cols: Column*): RelationalGroupedDataset
groupBy(col1: String, cols: String*): RelationalGroupedDataset

groupBy…​FIXME

join Untyped Transformation

join(right: Dataset[_]): DataFrame
join(right: Dataset[_], usingColumn: String): DataFrame
join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
join(right: Dataset[_], joinExprs: Column): DataFrame
join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

join…​FIXME

na Untyped Transformation

na: DataFrameNaFunctions

na simply creates a DataFrameNaFunctions to work with missing data.

rollup Untyped Transformation

rollup(cols: Column*): RelationalGroupedDataset
rollup(col1: String, cols: String*): RelationalGroupedDataset

rollup…​FIXME

select Untyped Transformation

select(cols: Column*): DataFrame
select(col: String, cols: String*): DataFrame

select…​FIXME

Projecting Columns using SQL Statements — selectExpr Untyped Transformation

selectExpr(exprs: String*): DataFrame

selectExpr is like select, but accepts SQL statements.

val ds = spark.range(5)

scala> ds.selectExpr("rand() as random").show
16/04/14 23:16:06 INFO HiveSqlParser: Parsing command: rand() as random
+-------------------+
|             random|
+-------------------+
|  0.887675894185651|
|0.36766085091074086|
| 0.2700020856675186|
| 0.1489033635529543|
| 0.5862990791950973|
+-------------------+

Internally, it executes select with every expression in exprs mapped to Column (using SparkSqlParser.parseExpression).

scala> ds.select(expr("rand() as random")).show
+------------------+
|            random|
+------------------+
|0.5514319279894851|
|0.2876221510433741|
|0.4599999092045741|
|0.5708558868374893|
|0.6223314406247136|
+------------------+

stat Untyped Transformation

stat: DataFrameStatFunctions

stat simply creates a DataFrameStatFunctions to work with statistic functions.

withColumn Untyped Transformation

withColumn(colName: String, col: Column): DataFrame

withColumn…​FIXME

withColumnRenamed Untyped Transformation

withColumnRenamed(existingName: String, newName: String): DataFrame

withColumnRenamed…​FIXME

results matching ""

    No results matching ""