log4j.logger.org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask=INFO
DataWritingSparkTask Partition Processing Function
DataWritingSparkTask
is the partition processing function that WriteToDataSourceV2Exec
physical operator uses to schedule a Spark job when requested to execute.
Note
|
The DataWritingSparkTask partition processing function is executed on executors.
|
Tip
|
Enable Add the following line to Refer to Logging. |
Running Partition Processing Function — run
Method
run(
writeTask: DataWriterFactory[InternalRow],
context: TaskContext,
iter: Iterator[InternalRow],
useCommitCoordinator: Boolean): WriterCommitMessage
run
requests the given TaskContext
for the IDs of the stage, the stage attempt, the partition, the task attempt, and how many times the task may have been attempted (default 0
).
run
also requests the given TaskContext
for the epoch ID (that is streaming.sql.batchId local property) or defaults to 0
.
run
requests the given DataWriterFactory
to create a DataWriter (with the partition, task and epoch IDs).
For every row in the partition (in the given Iterator[InternalRow]
), run
requests the DataWriter
to write the row.
Once all the rows have been written successfully, run
requests the DataWriter
to commit the write task (with or without requesting the OutputCommitCoordinator
for authorization) that gives the final WriterCommitMessage
.
In the end, run
prints out the following INFO message to the logs:
Committed partition [partId] (task [taskId], attempt [attemptId]stage [stageId].[stageAttempt])
In case of any errors, run
prints out the following ERROR message to the logs:
Aborting commit for partition [partId] (task [taskId], attempt [attemptId]stage [stageId].[stageAttempt])
run
then requests the DataWriter
to abort the write task.
In the end, run
prints out the following ERROR message to the logs:
Aborted commit for partition [partId] (task [taskId], attempt [attemptId]stage [stageId].[stageAttempt])
Note
|
run is used exclusively when WriteToDataSourceV2Exec physical operator is requested to execute (and schedules a Spark job).
|
useCommitCoordinator
Flag Enabled
With the given useCommitCoordinator
flag enabled (the default for most DataSourceWriters), run
requests the SparkEnv
for the OutputCommitCoordinator
that is then requested whether to commit the write task output or not (canCommit
).
Tip
|
Read up on OutputCommitCoordinator in the Mastering Apache Spark. |
If authorized, run
prints out the following INFO message to the logs:
Commit authorized for partition [partId] (task [taskId], attempt [attemptId]stage [stageId].[stageAttempt])
In the end, run
requests the DataWriter
to commit the write task.
If not authorized, run
prints out the following INFO message to the logs and throws a CommitDeniedException
.
Commit denied for partition [partId] (task [taskId], attempt [attemptId]stage [stageId].[stageAttempt])
useCommitCoordinator
Flag Disabled
With the given useCommitCoordinator
flag disabled, run
prints out the following INFO message to the logs:
Writer for partition [partId] is committing.
In the end, run
requests the DataWriter
to commit the write task.