import org.apache.spark.sql.functions._
Standard Functions — functions Object
org.apache.spark.sql.functions
object defines built-in standard functions to work with (values produced by) columns.
You can access the standard functions using the following import
statement in your Scala application:
Name | Description | ||
---|---|---|---|
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
Returns the first value in a group. Returns the first non-null value when |
|||
Indicates whether a given column is aggregated or not |
|||
Computes the level of grouping |
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
(New in 2.4.0) |
|||
(New in 2.4.0) |
|||
(New in 2.4.0) |
|||
(New in 2.4.0) |
|||
(New in 2.4.0) |
|||
(New in 2.4.0) |
|||
(New in 2.4.0) |
|||
(New in 2.4.0) |
|||
(New in 2.4.0) |
|||
(New in 2.4.0) |
|||
(New in 2.4.0) |
|||
(New in 2.4.0) |
|||
(New in 2.4.0) |
|||
(New in 2.4.0) |
|||
|
|||
Creates a new row for each element in the given array or map column. If the array/map is |
|||
(New in 2.4.0) |
|||
Parses a column with a JSON string into a StructType or ArrayType of |
|||
(New in 2.4.0) |
|||
(New in 2.4.0) |
|||
|
|||
|
|||
|
|||
|
|||
Returns a reversed string or an array with reverse order of elements
|
|||
(New in 2.4.0) |
|||
(New in 2.4.0) |
|||
(New in 2.4.0) |
|||
Returns the size of the given array or map. Returns -1 if |
|||
(New in 2.4.0) |
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
Math functions |
Converts the value of a long column to binary format |
||
Regular functions (Non-aggregate functions) |
|||
Gives the first non- |
|||
Creating Columns |
|||
Returns monotonically increasing 64-bit integers that are guaranteed to be monotonically increasing and unique, but not consecutive. |
|||
String functions |
|||
UDF functions |
Creating UDFs |
||
Executing an UDF by name with variable-length list of columns |
|||
Computes the cumulative distribution of records across window partitions |
|||
|
|||
Computes the rank of records per window partition |
|||
|
|||
|
|||
Computes the ntile group |
|||
Computes the rank of records per window partition |
|||
Computes the rank of records per window partition |
|||
Computes the sequential numbering per window partition |
|||
|
|||
|
Tip
|
The page gives only a brief ovierview of the many functions available in functions object and so you should read the official documentation of the functions object.
|
Executing UDF by Name and Variable-Length Column List — callUDF
Function
callUDF(udfName: String, cols: Column*): Column
callUDF
executes an UDF by udfName
and variable-length list of columns.
Defining UDFs — udf
Function
udf(f: FunctionN[...]): UserDefinedFunction
The udf
family of functions allows you to create user-defined functions (UDFs) based on a user-defined function in Scala. It accepts f
function of 0 to 10 arguments and the input and output types are automatically inferred (given the types of the respective input and output types of the function f
).
import org.apache.spark.sql.functions._
val _length: String => Int = _.length
val _lengthUDF = udf(_length)
// define a dataframe
val df = sc.parallelize(0 to 3).toDF("num")
// apply the user-defined function to "num" column
scala> df.withColumn("len", _lengthUDF($"num")).show
+---+---+
|num|len|
+---+---+
| 0| 1|
| 1| 1|
| 2| 1|
| 3| 1|
+---+---+
Since Spark 2.0.0, there is another variant of udf
function:
udf(f: AnyRef, dataType: DataType): UserDefinedFunction
udf(f: AnyRef, dataType: DataType)
allows you to use a Scala closure for the function argument (as f
) and explicitly declaring the output data type (as dataType
).
// given the dataframe above
import org.apache.spark.sql.types.IntegerType
val byTwo = udf((n: Int) => n * 2, IntegerType)
scala> df.withColumn("len", byTwo($"num")).show
+---+---+
|num|len|
+---+---+
| 0| 0|
| 1| 2|
| 2| 4|
| 3| 6|
+---+---+
split
Function
split(str: Column, pattern: String): Column
split
function splits str
column using pattern
. It returns a new Column
.
Note
|
split UDF uses java.lang.String.split(String regex, int limit) method.
|
val df = Seq((0, "hello|world"), (1, "witaj|swiecie")).toDF("num", "input")
val withSplit = df.withColumn("split", split($"input", "[|]"))
scala> withSplit.show
+---+-------------+----------------+
|num| input| split|
+---+-------------+----------------+
| 0| hello|world| [hello, world]|
| 1|witaj|swiecie|[witaj, swiecie]|
+---+-------------+----------------+
Note
|
.$|()[{^?*+\ are RegEx’s meta characters and are considered special.
|
upper
Function
upper(e: Column): Column
upper
function converts a string column into one with all letter upper. It returns a new Column
.
Note
|
The following example uses two functions that accept a Column and return another to showcase how to chain them.
|
val df = Seq((0,1,"hello"), (2,3,"world"), (2,4, "ala")).toDF("id", "val", "name")
val withUpperReversed = df.withColumn("upper", reverse(upper($"name")))
scala> withUpperReversed.show
+---+---+-----+-----+
| id|val| name|upper|
+---+---+-----+-----+
| 0| 1|hello|OLLEH|
| 2| 3|world|DLROW|
| 2| 4| ala| ALA|
+---+---+-----+-----+
Converting Long to Binary Format (in String Representation) — bin
Function
bin(e: Column): Column
bin(columnName: String): Column (1)
-
Calls the first
bin
withcolumnName
as aColumn
bin
converts the long value in a column to its binary format (i.e. as an unsigned integer in base 2) with no extra leading 0s.
scala> spark.range(5).withColumn("binary", bin('id)).show
+---+------+
| id|binary|
+---+------+
| 0| 0|
| 1| 1|
| 2| 10|
| 3| 11|
| 4| 100|
+---+------+
val withBin = spark.range(5).withColumn("binary", bin('id))
scala> withBin.printSchema
root
|-- id: long (nullable = false)
|-- binary: string (nullable = false)
Internally, bin
creates a Column with Bin
unary expression.
scala> withBin.queryExecution.logical
res2: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'Project [*, bin('id) AS binary#14]
+- Range (0, 5, step=1, splits=Some(8))
Note
|
Bin unary expression uses java.lang.Long.toBinaryString for the conversion.
|
Note
|
|