// Use ExpressionEncoder for simplicity
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
val stringEncoder = ExpressionEncoder[String]
val row = stringEncoder.toRow("hello world")
import org.apache.spark.sql.catalyst.expressions.UnsafeRow
val unsafeRow = row match { case ur: UnsafeRow => ur }
scala> unsafeRow.getBytes
res0: Array[Byte] = Array(0, 0, 0, 0, 0, 0, 0, 0, 11, 0, 0, 0, 16, 0, 0, 0, 104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 0, 0, 0, 0, 0)
scala> unsafeRow.getUTF8String(0)
res1: org.apache.spark.unsafe.types.UTF8String = hello world
UnsafeRow — Mutable Raw-Memory Unsafe Binary Row Format
UnsafeRow is a concrete InternalRow that represents a mutable internal raw-memory (and hence unsafe) binary row format.
In other words, UnsafeRow is an InternalRow that is backed by raw memory instead of Java objects.
UnsafeRow knows its size in bytes.
scala> println(unsafeRow.getSizeInBytes)
32
UnsafeRow supports Java’s Externalizable and Kryo’s KryoSerializable serialization/deserialization protocols.
The fields of a data row are placed using field offsets.
UnsafeRow considers a data type mutable if it is one of the following:
UnsafeRow is composed of three regions:
-
Null Bit Set Bitmap Region (1 bit/field) for tracking null values
-
Fixed-Length 8-Byte Values Region
-
Variable-Length Data Section
That gives the property of rows being always 8-byte word aligned and so their size is always a multiple of 8 bytes.
Equality comparision and hashing of rows can be performed on raw bytes since if two rows are identical so should be their bit-wise representation. No type-specific interpretation is required.
isMutable Static Predicate
static boolean isMutable(DataType dt)
isMutable is enabled (true) when the input DataType is among the mutable field types or a DecimalType.
Otherwise, isMutable is disabled (false).
|
Note
|
|
Kryo’s KryoSerializable SerDe Protocol
|
Tip
|
Read up on KryoSerializable. |
Java’s Externalizable SerDe Protocol
|
Tip
|
Read up on java.io.Externalizable. |