FileStreamSink — Streaming Sink for Parquet Format

FileStreamSink is the streaming sink that writes out the results of a streaming query to parquet files.

import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val out = in.
  writeStream.
  format("parquet").
  option("path", "parquet-output-dir").
  option("checkpointLocation", "checkpoint-dir").
  trigger(Trigger.ProcessingTime(10.seconds)).
  outputMode(OutputMode.Append).
  start

FileStreamSink is created exclusively when DataSource is requested to create a streaming sink.

FileStreamSink supports Append output mode only.

FileStreamSink uses spark.sql.streaming.fileSink.log.deletion (as isDeletingExpiredLog)

The textual representation of FileStreamSink is FileSink[path]

Table 1. FileStreamSink’s Internal Properties (e.g. Registries, Counters and Flags)
Name Description

basePath

FIXME

Used when…​FIXME

logPath

FIXME

Used when…​FIXME

fileLog

FIXME

Used when…​FIXME

hadoopConf

FIXME

Used when…​FIXME

addBatch Method

addBatch(batchId: Long, data: DataFrame): Unit
Note
addBatch is a part of Sink Contract to "add" a batch of data to the sink.

addBatch…​FIXME

Creating FileStreamSink Instance

FileStreamSink takes the following when created:

  • SparkSession

  • Path with the metadata directory

  • FileFormat

  • Names of the partition columns

  • Configuration options

FileStreamSink initializes the internal registries and counters.

results matching ""

    No results matching ""