withWatermark(eventTime: String, delayThreshold: String): Dataset[T]
withWatermark Operator — Event-Time Watermark
withWatermark specifies the eventTime column for event time watermark and delayThreshold for event lateness.
eventTime specifies the column to use for watermark and can be either part of Dataset from the source or custom-generated using current_time or current_timestamp functions.
|
Note
|
Watermark tracks a point in time before which it is assumed no more late events are supposed to arrive (and if they have, the late events are considered really late and simply dropped). |
|
Note
|
Spark Structured Streaming uses watermark for the following:
|
The current watermark is computed by looking at the maximum eventTime seen across all of the partitions in a query minus a user-specified delayThreshold. Due to the cost of coordinating this value across partitions, the actual watermark used is only guaranteed to be at least delayThreshold behind the actual event time.
|
Note
|
In some cases Spark may still process records that arrive more than delayThreshold late.
|