OffsetSeqLog — HDFSMetadataLog for Write-Ahead Log (WAL)

OffsetSeqLog is a HDFSMetadataLog with OffsetSeq metadata.

OffsetSeqLog is created exclusively for the write-ahead log (WAL) of offsets of StreamExecution.

OffsetSeqLog uses OffsetSeq for metadata which holds an ordered collection of zero or more offsets and optional metadata (as OffsetSeqMetadata for keeping track of event time watermark as set up by a Spark developer and what was found in the records).

OffsetSeqLog (like the parent HDFSMetadataLog) takes the following to be created:

  • SparkSession

  • Path of the metadata log directory

Serializing Metadata — serialize Method

serialize(
  offsetSeq: OffsetSeq,
  out: OutputStream): Unit
Note
serialize is part of HDFSMetadataLog Contract to write a metadata in serialized format.

serialize firstly writes out the version prefixed with v on a single line (e.g. v1) followed by the optional metadata in JSON format.

Note
The version in Spark 2.2 is 1 with the charset being UTF-8.

serialize then writes out the offsets in JSON format, one per line.

Note
No offsets to write in offsetSeq for a streaming source is marked as - (a dash) in the log.
$ ls -tr [checkpoint-directory]/offsets
0 1 2 3 4 5 6

$ cat [checkpoint-directory]/offsets/6
v1
{"batchWatermarkMs":0,"batchTimestampMs":1502872590006,"conf":{"spark.sql.shuffle.partitions":"200","spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider"}}
51

Deserializing Metadata — deserialize Method

deserialize(in: InputStream): OffsetSeq
Note
deserialize is part of HDFSMetadataLog Contract to deserialize a metadata (from an InputStream).

deserialize…​FIXME

results matching ""

    No results matching ""