Dataset API vs SQL

Spark SQL supports two "modes" to write structured queries: Dataset API and SQL.

SQL Mode is used to express structured queries using SQL statements using SparkSession.sql operator, expr standard function and spark-sql command-line tool.

Some structured queries can be expressed much easier using Dataset API, but there are some that are only possible in SQL. In other words, you may find mixing Dataset API and SQL modes challenging yet rewarding.

What is important, and one of the reasons why Spark SQL has been so successful, is that there is no performance difference between the modes. Whatever mode you use to write your structured queries, they all end up as a tree of Catalyst relational data structures. And, yes, you could consider writing structured queries using Catalyst directly, but that could quickly become unwieldy for maintenance (i.e. finding Spark SQL developers who could be comfortable with it as well as being fairly low-level and therefore possibly too dependent on a specific Spark SQL version).

The takeaway is that SQL queries in Spark SQL are translated to Catalyst logical commands.

This section describes the differences between Spark SQL features to develop Spark applications using Dataset API and SQL mode.

  1. RuntimeReplaceable Expressions are only available using SQL mode by means of SQL functions like nvl, nvl2, ifnull, nullif, etc.

  2. Column.isin and SQL IN predicate with a subquery (and In Predicate Expression)

Demo: Structured Query in SQL Mode VS Dataset API
// FIXME: Example of a structured query that is only possible in SQL mode

results matching ""

    No results matching ""