log4j.logger.org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask=INFO
DataWritingSparkTask Partition Processing Function
DataWritingSparkTask is the partition processing function that WriteToDataSourceV2Exec physical operator uses to schedule a Spark job when requested to execute.
|
Note
|
The DataWritingSparkTask partition processing function is executed on executors.
|
|
Tip
|
Enable Add the following line to Refer to Logging. |
Running Partition Processing Function — run Method
run(
writeTask: DataWriterFactory[InternalRow],
context: TaskContext,
iter: Iterator[InternalRow],
useCommitCoordinator: Boolean): WriterCommitMessage
run requests the given TaskContext for the IDs of the stage, the stage attempt, the partition, the task attempt, and how many times the task may have been attempted (default 0).
run also requests the given TaskContext for the epoch ID (that is streaming.sql.batchId local property) or defaults to 0.
run requests the given DataWriterFactory to create a DataWriter (with the partition, task and epoch IDs).
For every row in the partition (in the given Iterator[InternalRow]), run requests the DataWriter to write the row.
Once all the rows have been written successfully, run requests the DataWriter to commit the write task (with or without requesting the OutputCommitCoordinator for authorization) that gives the final WriterCommitMessage.
In the end, run prints out the following INFO message to the logs:
Committed partition [partId] (task [taskId], attempt [attemptId]stage [stageId].[stageAttempt])
In case of any errors, run prints out the following ERROR message to the logs:
Aborting commit for partition [partId] (task [taskId], attempt [attemptId]stage [stageId].[stageAttempt])
run then requests the DataWriter to abort the write task.
In the end, run prints out the following ERROR message to the logs:
Aborted commit for partition [partId] (task [taskId], attempt [attemptId]stage [stageId].[stageAttempt])
|
Note
|
run is used exclusively when WriteToDataSourceV2Exec physical operator is requested to execute (and schedules a Spark job).
|
useCommitCoordinator Flag Enabled
With the given useCommitCoordinator flag enabled (the default for most DataSourceWriters), run requests the SparkEnv for the OutputCommitCoordinator that is then requested whether to commit the write task output or not (canCommit).
|
Tip
|
Read up on OutputCommitCoordinator in the Mastering Apache Spark. |
If authorized, run prints out the following INFO message to the logs:
Commit authorized for partition [partId] (task [taskId], attempt [attemptId]stage [stageId].[stageAttempt])
In the end, run requests the DataWriter to commit the write task.
If not authorized, run prints out the following INFO message to the logs and throws a CommitDeniedException.
Commit denied for partition [partId] (task [taskId], attempt [attemptId]stage [stageId].[stageAttempt])
useCommitCoordinator Flag Disabled
With the given useCommitCoordinator flag disabled, run prints out the following INFO message to the logs:
Writer for partition [partId] is committing.
In the end, run requests the DataWriter to commit the write task.