UnsafeRow — Mutable Raw-Memory Unsafe Binary Row Format

UnsafeRow is a concrete InternalRow that represents a mutable internal raw-memory (and hence unsafe) binary row format.

In other words, UnsafeRow is an InternalRow that is backed by raw memory instead of Java objects.

// Use ExpressionEncoder for simplicity
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
val stringEncoder = ExpressionEncoder[String]
val row = stringEncoder.toRow("hello world")

import org.apache.spark.sql.catalyst.expressions.UnsafeRow
val unsafeRow = row match { case ur: UnsafeRow => ur }

scala> unsafeRow.getBytes
res0: Array[Byte] = Array(0, 0, 0, 0, 0, 0, 0, 0, 11, 0, 0, 0, 16, 0, 0, 0, 104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 0, 0, 0, 0, 0)

scala> unsafeRow.getUTF8String(0)
res1: org.apache.spark.unsafe.types.UTF8String = hello world

UnsafeRow knows its size in bytes.

scala> println(unsafeRow.getSizeInBytes)
32

UnsafeRow supports Java’s Externalizable and Kryo’s KryoSerializable serialization/deserialization protocols.

The fields of a data row are placed using field offsets.

UnsafeRow considers a data type mutable if it is one of the following:

UnsafeRow is composed of three regions:

Null Bit Set Bitmap Region (1 bit/field) for tracking null values
Fixed-Length 8-Byte Values Region
Variable-Length Data Section

That gives the property of rows being always 8-byte word aligned and so their size is always a multiple of 8 bytes.

Equality comparision and hashing of rows can be performed on raw bytes since if two rows are identical so should be their bit-wise representation. No type-specific interpretation is required.

`isMutable` Static Predicate

static boolean isMutable(DataType dt)

isMutable is enabled (true) when the input DataType is among the mutable field types or a DecimalType.

Otherwise, isMutable is disabled (false).

Note	`isMutable` is used when: `UnsafeFixedWidthAggregationMap` is requested to supportsAggregationBufferSchema `SortBasedAggregationIterator` is requested for newBuffer

Kryo’s KryoSerializable SerDe Protocol

Tip	Read up on KryoSerializable.

Serializing JVM Object — KryoSerializable’s `write` Method

void write(Kryo kryo, Output out)

Deserializing Kryo-Managed Object — KryoSerializable’s `read` Method

void read(Kryo kryo, Input in)

Java’s Externalizable SerDe Protocol

Tip	Read up on java.io.Externalizable.

Serializing JVM Object — Externalizable’s `writeExternal` Method

void writeExternal(ObjectOutput out)
throws IOException

Deserializing Java-Externalized Object — Externalizable’s `readExternal` Method

void readExternal(ObjectInput in)
throws IOException, ClassNotFoundException

`pointTo` Method

void pointTo(Object baseObject, long baseOffset, int sizeInBytes)

pointTo…FIXME

Note	`pointTo` is used when…FIXME

UnsafeRow — Mutable Raw-Memory Unsafe Binary Row Format