DataSource — Pluggable Data Provider Framework

Creating DataSource Instance

DataSource takes the following to be created:

  • SparkSession

  • className, i.e. the fully-qualified class name or an alias of the data source

  • Paths (default: Nil, i.e. an empty collection)

  • Optional user-defined schema (default: None)

  • Names of the partition columns (default: (empty))

  • Optional BucketSpec (default: None)

  • Configuration options (default: empty)

  • Optional CatalogTable (default: None)

DataSource initializes the internal properties.

Generating Metadata of Streaming Source (Data Source API V1) — sourceSchema Internal Method

sourceSchema(): SourceInfo

sourceSchema creates a new instance of the data source class and branches off per the type, e.g. StreamSourceProvider, FileFormat and other types.

Note
sourceSchema is used exclusively when DataSource is requested for the SourceInfo.

StreamSourceProvider

For a StreamSourceProvider, sourceSchema requests the StreamSourceProvider for the name and schema (of the streaming source).

In the end, sourceSchema returns the name and the schema as part of SourceInfo (with partition columns unspecified).

FileFormat

For a FileFormat, sourceSchema…​FIXME

Other Types

For any other data source type, sourceSchema simply throws an UnsupportedOperationException:

Data source [className] does not support streamed reading

Creating Streaming Source (Micro-Batch Stream Processing / Data Source API V1) — createSource Method

createSource(
  metadataPath: String): Source

createSource creates a new instance of the data source class and branches off per the type, e.g. StreamSourceProvider, FileFormat and other types.

Note
createSource is used exclusively when MicroBatchExecution is requested to initialize the analyzed logical plan.

StreamSourceProvider

For a StreamSourceProvider, createSource requests the StreamSourceProvider to create a source.

FileFormat

For a FileFormat, createSource creates a new FileStreamSource.

createSource throws an IllegalArgumentException when path option was not specified for a FileFormat data source:

'path' is not specified

Other Types

For any other data source type, createSource simply throws an UnsupportedOperationException:

Data source [className] does not support streamed reading

Creating Streaming Sink — createSink Method

createSink(
  outputMode: OutputMode): Sink

createSink creates a streaming sink for StreamSinkProvider or FileFormat data sources.

Internally, createSink creates a new instance of the providingClass and branches off per type:

createSink throws a IllegalArgumentException when path option is not specified for a FileFormat data source:

'path' is not specified

createSink throws an AnalysisException when the given OutputMode is different from Append for a FileFormat data source:

Data source [className] does not support [outputMode] output mode

createSink throws an UnsupportedOperationException for unsupported data source formats:

Data source [className] does not support streamed writing
Note
createSink is used exclusively when DataStreamWriter is requested to start a streaming query.

Internal Properties

Name Description

providingClass

java.lang.Class for the className (that can be a fully-qualified class name or an alias of the data source)

sourceInfo

sourceInfo: SourceInfo

Metadata of a Source with the alias (short name), the schema, and optional partitioning columns

sourceInfo is a lazy value and so initialized once (the very first time) when accessed.

Used when:

results matching ""

    No results matching ""