HDFSMetadataLog — Hadoop DFS-based Metadata Storage

HDFSMetadataLog is a concrete metadata storage (of type T) that uses Hadoop DFS for fault-tolerance and reliability.

HDFSMetadataLog uses the given path as the metadata directory with metadata logs. The path is immediately converted to a Hadoop Path for file management.

HDFSMetadataLog uses Json4s with the Jackson binding for metadata serialization and deserialization (to and from JSON format).

HDFSMetadataLog is further customized by the extensions.

Table 1. HDFSMetadataLogs (Direct Extensions Only)
HDFSMetadataLog Description

Anonymous

HDFSMetadataLog of KafkaSourceOffsets for KafkaSource

Anonymous

HDFSMetadataLog of LongOffsets for RateStreamMicroBatchReader

CommitLog

Offset commit log of streaming query execution engines

CompactibleFileStreamLog

Compactible metadata logs (that compact logs at regular interval)

KafkaSourceInitialOffsetWriter

HDFSMetadataLog of KafkaSourceOffsets for KafkaSource

OffsetSeqLog

Write-Ahead Log (WAL) of stream execution engines

Creating HDFSMetadataLog Instance

HDFSMetadataLog takes the following to be created:

  • SparkSession

  • Path of the metadata log directory

While being created HDFSMetadataLog creates the path unless exists already.

Serializing Metadata (Writing Metadata in Serialized Format) — serialize Method

serialize(
  metadata: T,
  out: OutputStream): Unit

serialize simply writes the log data (serialized using Json4s (with Jackson binding) library).

Note
serialize is used exclusively when HDFSMetadataLog is requested to write metadata of a streaming batch to a file (metadata log) (when storing metadata of a streaming batch).

Deserializing Metadata (Reading Metadata from Serialized Format) — deserialize Method

deserialize(in: InputStream): T

deserialize deserializes a metadata (of type T) from a given InputStream.

Note
deserialize is used exclusively when HDFSMetadataLog is requested to retrieve metadata of a batch.

Retrieving Metadata Of Streaming Batch — get Method

get(batchId: Long): Option[T]
Note
get is part of the MetadataLog Contract to get metadata of a batch.

get…​FIXME

Retrieving Metadata of Range of Batches — get Method

get(
  startId: Option[Long],
  endId: Option[Long]): Array[(Long, T)]
Note
get is part of the MetadataLog Contract to get metadata of range of batches.

get…​FIXME

Persisting Metadata of Streaming Micro-Batch — add Method

add(
  batchId: Long,
  metadata: T): Boolean
Note
add is part of the MetadataLog Contract to persist metadata of a streaming batch.

add return true when the metadata of the streaming batch was not available and persisted successfully. Otherwise, add returns false.

Internally, add looks up metadata of the given streaming batch (batchId) and returns false when found.

Otherwise, when not found, add creates a metadata log file for the given batchId and writes metadata to the file. add returns true if successful.

Latest Committed Batch Id with Metadata (When Available) — getLatest Method

getLatest(): Option[(Long, T)]
Note
getLatest is a part of MetadataLog Contract to retrieve the recently-committed batch id and the corresponding metadata if available in the metadata storage.

getLatest requests the internal FileManager for the files in metadata directory that match batch file filter.

getLatest takes the batch ids (the batch files correspond to) and sorts the ids in reverse order.

getLatest gives the first batch id with the metadata which could be found in the metadata storage.

Note
It is possible that the batch id could be in the metadata storage, but not available for retrieval.

Removing Expired Metadata (Purging) — purge Method

purge(thresholdBatchId: Long): Unit
Note
purge is part of the MetadataLog Contract to…​FIXME.

purge…​FIXME

Creating Batch Metadata File — batchIdToPath Method

batchIdToPath(batchId: Long): Path

batchIdToPath simply creates a Hadoop Path for the file called by the specified batchId under the metadata directory.

Note

batchIdToPath is used when:

isBatchFile Method

isBatchFile(path: Path): Boolean

isBatchFile…​FIXME

Note
isBatchFile is used exclusively when HDFSMetadataLog is requested for the PathFilter of batch files.

pathToBatchId Method

pathToBatchId(path: Path): Long

pathToBatchId…​FIXME

Note

pathToBatchId is used when:

verifyBatchIds Object Method

verifyBatchIds(
  batchIds: Seq[Long],
  startId: Option[Long],
  endId: Option[Long]): Unit

verifyBatchIds…​FIXME

Note

verifyBatchIds is used when:

  • FileStreamSourceLog is requested to get

  • HDFSMetadataLog is requested to get

Retrieving Version (From Text Line) — parseVersion Internal Method

parseVersion(
  text: String,
  maxSupportedVersion: Int): Int

parseVersion…​FIXME

Note

parseVersion is used when:

purgeAfter Method

purgeAfter(thresholdBatchId: Long): Unit

purgeAfter…​FIXME

Note
purgeAfter seems to be used exclusively in tests.

Writing Batch Metadata to File (Metadata Log) — writeBatchToFile Internal Method

writeBatchToFile(
  metadata: T,
  path: Path): Unit

writeBatchToFile requests the CheckpointFileManager to createAtomic (for the specified path and the overwriteIfPossible flag disabled).

writeBatchToFile then serializes the metadata (to the CancellableFSDataOutputStream output stream) and closes the stream.

In case of an exception, writeBatchToFile simply requests the CancellableFSDataOutputStream output stream to cancel (so that the output file is not generated) and re-throws the exception.

Note
writeBatchToFile is used exclusively when HDFSMetadataLog is requested to store (persist) metadata of a streaming batch.

Retrieving Ordered Batch Metadata Files — getOrderedBatchFiles Method

getOrderedBatchFiles(): Array[FileStatus]

getOrderedBatchFiles…​FIXME

Note
getOrderedBatchFiles does not seem to be used at all.

Internal Properties

Name Description

batchFilesFilter

Hadoop’s PathFilter of batch files (with names being long numbers)

Used when:

fileManager

Used when…​FIXME

results matching ""

    No results matching ""