DataSourceWriter Contract

DataSourceWriter is the abstraction of data source writers in Data Source API V2 that can abort or commit a writing Spark job, create a DataWriterFactory to be shared among writing Spark tasks and optionally handle a commit message and use a CommitCoordinator for writing Spark tasks.

Note
The terms Spark job and Spark task are really about the low-level Spark jobs and tasks (that you can monitor using web UI for example).

DataSourceWriter is used to create a logical WriteToDataSourceV2 and physical WriteToDataSourceV2Exec operators.

DataSourceWriter is created when:

  • DataSourceV2Relation logical operator is requested to create one

  • WriteSupport data source is requested to create one

Table 1. DataSourceWriter Contract
Method Description

abort

void abort(WriterCommitMessage[] messages)

Aborts a writing Spark job

Used exclusively when WriteToDataSourceV2Exec physical operator is requested to execute (and an exception was reported)

commit

void commit(WriterCommitMessage[] messages)

Commits a writing Spark job

Used exclusively when WriteToDataSourceV2Exec physical operator is requested to execute (and writing tasks all completed successfully)

createWriterFactory

DataWriterFactory<InternalRow> createWriterFactory()

Creates a DataWriterFactory

Used when:

  • WriteToDataSourceV2Exec physical operator is requested to execute

  • Spark Structured Streaming’s WriteToContinuousDataSourceExec physical operator is requested to execute

  • Spark Structured Streaming’s MicroBatchWriter is requested to create a DataWriterFactory

onDataWriterCommit

void onDataWriterCommit(WriterCommitMessage message)

Handles WriterCommitMessage commit message for a single successful writing Spark task

Defaults to do nothing

Used exclusively when WriteToDataSourceV2Exec physical operator is requested to execute (and runs a Spark job with partition writing tasks)

useCommitCoordinator

boolean useCommitCoordinator()

Controls whether to use a Spark Core OutputCommitCoordinator (true) or not (false) for data writing (to make sure that at most one task for a partition commits)

Default: true

Used exclusively when WriteToDataSourceV2Exec physical operator is requested to execute

Table 2. DataSourceWriters (Direct Implementations and Extensions)
DataSourceWriter Description

MicroBatchWriter

Used in Spark Structured Streaming only for Micro-Batch Stream Processing

StreamWriter

Used in Spark Structured Streaming only (to support epochs)

results matching ""

    No results matching ""