Data Source API V2

Data Source API V2 (DataSource API V2 or DataSource V2) is a new API for data sources in Spark SQL with the following abstractions (contracts):

Note	The work on Data Source API V2 was tracked under SPARK-15689 Data source API v2 that was fixed in Apache Spark 2.3.0.

Note	Data Source API V2 is already heavily used in Spark Structured Streaming.

Query Planning and Execution

Data Source API V2 relies on the DataSourceV2Strategy execution planning strategy for query planning.

Data Reading

Data Source API V2 uses DataSourceV2Relation logical operator to represent data reading (aka data scan).

DataSourceV2Relation is planned (translated) to a ProjectExec with a DataSourceV2ScanExec physical operator (possibly under the FilterExec operator) when DataSourceV2Strategy execution planning strategy is requested to plan a logical plan.

At execution, DataSourceV2ScanExec physical operator creates a DataSourceRDD (or a ContinuousReader for Spark Structured Streaming).

DataSourceRDD uses InputPartitions for partitions, preferred locations, and computing partitions.

Data Writing

Data Source API V2 uses WriteToDataSourceV2 and AppendData logical operators to represent data writing (over a DataSourceV2Relation logical operator). As of Spark SQL 2.4.0, WriteToDataSourceV2 operator was deprecated for the more specific AppendData operator (compare "data writing" to "data append" which is certainly more specific).

Note	One of the differences between `WriteToDataSourceV2` and `AppendData` logical operators is that the former (`WriteToDataSourceV2`) uses DataSourceWriter directly while the latter (`AppendData`) uses DataSourceV2Relation to get the DataSourceWriter from.

WriteToDataSourceV2 and AppendData (with DataSourceV2Relation) logical operators are planned as (translated to) a WriteToDataSourceV2Exec physical operator.

At execution, WriteToDataSourceV2Exec physical operator…FIXME

Filter Pushdown Performance Optimization

Data Source API V2 supports filter pushdown performance optimization for DataSourceReaders with SupportsPushDownFilters (that is applied when DataSourceV2Strategy execution planning strategy is requested to plan a DataSourceV2Relation logical operator).

(From Parquet Filter Pushdown in Apache Drill’s documentation) Filter pushdown is a performance optimization that prunes extraneous data while reading from a data source to reduce the amount of data to scan and read for queries with supported filter expressions. Pruning data reduces the I/O, CPU, and network overhead to optimize query performance.

Tip	Enable INFO logging level for the DataSourceV2Strategy logger to be told what the pushed filters are.

Data Source API V2