CompactibleFileStreamLog Contract — Compactible Metadata Logs

CompactibleFileStreamLog is the extension of the HDFSMetadataLog contract for compactible metadata logs that compactLogs every compact interval.

CompactibleFileStreamLog uses spark.sql.streaming.minBatchesToRetain configuration property (default: 100) for deleteExpiredLog.

CompactibleFileStreamLog uses .compact suffix for batchIdToPath, getBatchIdFromFileName, and the compactInterval.

Table 1. CompactibleFileStreamLog Contract (Abstract Methods Only)
Method Description

compactLogs

compactLogs(logs: Seq[T]): Seq[T]

Used when CompactibleFileStreamLog is requested to compact and allFiles

defaultCompactInterval

defaultCompactInterval: Int

Used exclusively when CompactibleFileStreamLog is requested for the compactInterval

fileCleanupDelayMs

fileCleanupDelayMs: Long

Used exclusively when CompactibleFileStreamLog is requested to deleteExpiredLog

isDeletingExpiredLog

isDeletingExpiredLog: Boolean

Used exclusively when CompactibleFileStreamLog is requested to store (add) metadata of a streaming batch

Table 2. CompactibleFileStreamLogs
CompactibleFileStreamLog Description

FileStreamSinkLog

FileStreamSourceLog

CompactibleFileStreamLog (of FileEntry metadata) of FileStreamSource

Creating CompactibleFileStreamLog Instance

CompactibleFileStreamLog takes the following to be created:

  • Metadata version

  • SparkSession

  • Path of the metadata log directory

Note
CompactibleFileStreamLog is a Scala abstract class and cannot be created directly. It is created indirectly for the concrete CompactibleFileStreamLogs.

batchIdToPath Method

batchIdToPath(batchId: Long): Path
Note
batchIdToPath is part of the HDFSMetadataLog Contract to…​FIXME.

batchIdToPath…​FIXME

pathToBatchId Method

pathToBatchId(path: Path): Long
Note
pathToBatchId is part of the HDFSMetadataLog Contract to…​FIXME.

pathToBatchId…​FIXME

isBatchFile Method

isBatchFile(path: Path): Boolean
Note
isBatchFile is part of the HDFSMetadataLog Contract to…​FIXME.

isBatchFile…​FIXME

Serializing Metadata (Writing Metadata in Serialized Format) — serialize Method

serialize(
  logData: Array[T],
  out: OutputStream): Unit
Note
serialize is part of the HDFSMetadataLog Contract to serialize metadata (write metadata in serialized format).

serialize firstly writes the version header (v and the metadataLogVersion) out to the given output stream (in UTF_8).

serialize then writes the log data (serialized using Json4s (with Jackson binding) library). Entries are separated by new lines.

Deserializing Metadata — deserialize Method

deserialize(in: InputStream): Array[T]
Note
deserialize is part of the HDFSMetadataLog Contract to…​FIXME.

deserialize…​FIXME

Storing Metadata Of Streaming Batch — add Method

add(
  batchId: Long,
  logs: Array[T]): Boolean
Note
add is part of the HDFSMetadataLog Contract to store metadata for a batch.

add…​FIXME

allFiles Method

allFiles(): Array[T]

allFiles…​FIXME

Note

allFiles is used when:

compact Internal Method

compact(
  batchId: Long,
  logs: Array[T]): Boolean

compact getValidBatchesBeforeCompactionBatch (with the streaming batch and the compact interval).

compact…​FIXME

In the end, compact compactLogs and requests the parent HDFSMetadataLog to persist metadata of a streaming batch (to a metadata log file).

Note
compact is used exclusively when CompactibleFileStreamLog is requested to persist metadata of a streaming batch.

getValidBatchesBeforeCompactionBatch Object Method

getValidBatchesBeforeCompactionBatch(
  compactionBatchId: Long,
  compactInterval: Int): Seq[Long]

getValidBatchesBeforeCompactionBatch…​FIXME

Note
getValidBatchesBeforeCompactionBatch is used exclusively when CompactibleFileStreamLog is requested to compact.

isCompactionBatch Object Method

isCompactionBatch(batchId: Long, compactInterval: Int): Boolean

isCompactionBatch…​FIXME

Note

isCompactionBatch is used when:

getBatchIdFromFileName Object Method

getBatchIdFromFileName(fileName: String): Long

getBatchIdFromFileName simply removes the .compact suffix from the given fileName and converts the remaining part to a number.

Note
getBatchIdFromFileName is used when CompactibleFileStreamLog is requested to pathToBatchId, isBatchFile, and deleteExpiredLog.

deleteExpiredLog Internal Method

deleteExpiredLog(
  currentBatchId: Long): Unit

deleteExpiredLog does nothing and simply returns when the current batch ID incremented (currentBatchId + 1) is below the compact interval plus the minBatchesToRetain.

deleteExpiredLog…​FIXME

Note
deleteExpiredLog is used exclusively when CompactibleFileStreamLog is requested to store metadata of a streaming batch.

Internal Properties

Name Description

compactInterval

Compact interval

results matching ""

    No results matching ""