import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val sq = in.
writeStream.
format("parquet").
option("path", "parquet-output-dir").
option("checkpointLocation", "checkpoint-dir").
trigger(Trigger.ProcessingTime(10.seconds)).
outputMode(OutputMode.Append).
start
FileStreamSink — Streaming Sink for File-Based Data Sources
FileStreamSink is a concrete streaming sink that writes out the results of a streaming query to files (of the specified FileFormat) in the root path.
FileStreamSink is created exclusively when DataSource is requested to create a streaming sink for a file-based data source (i.e. FileFormat).
|
Tip
|
Read up on FileFormat in The Internals of Spark SQL book. |
FileStreamSink supports Append output mode only.
FileStreamSink uses spark.sql.streaming.fileSink.log.deletion (as isDeletingExpiredLog)
The textual representation of FileStreamSink is FileSink[path]
FileStreamSink uses _spark_metadata directory for…FIXME
|
Tip
|
Enable Add the following line to
Refer to Logging. |
Creating FileStreamSink Instance
FileStreamSink takes the following to be created:
FileStreamSink initializes the internal properties.
"Adding" Batch of Data to Sink — addBatch Method
addBatch(
batchId: Long,
data: DataFrame): Unit
|
Note
|
addBatch is a part of Sink Contract to "add" a batch of data to the sink.
|
addBatch…FIXME
Creating BasicWriteJobStatsTracker — basicWriteJobStatsTracker Internal Method
basicWriteJobStatsTracker: BasicWriteJobStatsTracker
basicWriteJobStatsTracker simply creates a BasicWriteJobStatsTracker with the basic metrics:
-
number of written files
-
bytes of written output
-
number of output rows
-
number of dynamic partitions
|
Tip
|
Read up on BasicWriteJobStatsTracker in The Internals of Spark SQL book. |
|
Note
|
basicWriteJobStatsTracker is used exclusively when FileStreamSink is requested to addBatch.
|
hasMetadata Object Method
hasMetadata(
path: Seq[String],
hadoopConf: Configuration): Boolean
hasMetadata…FIXME
|
Note
|
|
Internal Properties
| Name | Description |
|---|---|
|
|
|
Metadata log path (Hadoop’s Path for the base path and the _spark_metadata) Used exclusively to create the FileStreamSinkLog |
|
FileStreamSinkLog (for the version 1 and the metadata log path) Used exclusively when |
|
Hadoop’s Configuration Used when…FIXME |