// Use ExpressionEncoder for simplicity
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
val stringEncoder = ExpressionEncoder[String]
val row = stringEncoder.toRow("hello world")
import org.apache.spark.sql.catalyst.expressions.UnsafeRow
val unsafeRow = row match { case ur: UnsafeRow => ur }
scala> unsafeRow.getBytes
res0: Array[Byte] = Array(0, 0, 0, 0, 0, 0, 0, 0, 11, 0, 0, 0, 16, 0, 0, 0, 104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 0, 0, 0, 0, 0)
scala> unsafeRow.getUTF8String(0)
res1: org.apache.spark.unsafe.types.UTF8String = hello world
UnsafeRow — Mutable Raw-Memory Unsafe Binary Row Format
UnsafeRow
is a concrete InternalRow that represents a mutable internal raw-memory (and hence unsafe) binary row format.
In other words, UnsafeRow
is an InternalRow
that is backed by raw memory instead of Java objects.
UnsafeRow
knows its size in bytes.
scala> println(unsafeRow.getSizeInBytes)
32
UnsafeRow
supports Java’s Externalizable and Kryo’s KryoSerializable serialization/deserialization protocols.
The fields of a data row are placed using field offsets.
UnsafeRow
considers a data type mutable if it is one of the following:
UnsafeRow
is composed of three regions:
-
Null Bit Set Bitmap Region (1 bit/field) for tracking null values
-
Fixed-Length 8-Byte Values Region
-
Variable-Length Data Section
That gives the property of rows being always 8-byte word aligned and so their size is always a multiple of 8 bytes.
Equality comparision and hashing of rows can be performed on raw bytes since if two rows are identical so should be their bit-wise representation. No type-specific interpretation is required.
isMutable
Static Predicate
static boolean isMutable(DataType dt)
isMutable
is enabled (true
) when the input DataType is among the mutable field types or a DecimalType.
Otherwise, isMutable
is disabled (false
).
Note
|
|
Kryo’s KryoSerializable SerDe Protocol
Tip
|
Read up on KryoSerializable. |
Java’s Externalizable SerDe Protocol
Tip
|
Read up on java.io.Externalizable. |