The Internals of Spark SQL
Introduction
Spark SQL — Structured Data Processing with Relational Queries on Massive Scale
Datasets vs DataFrames vs RDDs
Dataset API vs SQL
Developing Spark SQL Applications
Fundamentals of Spark SQL Application Development
SparkSession — The Entry Point to Spark SQL
Builder — Building SparkSession using Fluent API
implicits Object — Implicits Conversions
SparkSessionExtensions
Dataset — Structured Query with Data Encoder
DataFrame — Dataset of Rows with RowEncoder
Row
DataSource API — Managing Datasets in External Data Sources
DataFrameReader — Loading Data From External Data Sources
DataFrameWriter — Saving Data To External Data Sources
Dataset API — Dataset Operators
Typed Transformations
Untyped Transformations
Basic Actions
Actions
DataFrameNaFunctions — Working With Missing Data
DataFrameStatFunctions — Working With Statistic Functions
Column
Column API — Column Operators
TypedColumn
Basic Aggregation — Typed and Untyped Grouping Operators
RelationalGroupedDataset — Untyped Row-based Grouping
KeyValueGroupedDataset — Typed Grouping
Dataset Join Operators
Broadcast Joins (aka Map-Side Joins)
Window Aggregation
WindowSpec — Window Specification
Window Utility Object — Defining Window Specification
Standard Functions — functions Object
Aggregate Functions
Collection Functions
Date and Time Functions
Regular Functions (Non-Aggregate Functions)
Window Aggregation Functions
User-Defined Functions (UDFs)
UDFs are Blackbox — Don’t Use Them Unless You’ve Got No Choice
UserDefinedFunction
Schema — Structure of Data
StructType
StructField — Single Field in StructType
Data Types
Multi-Dimensional Aggregation
Dataset Caching and Persistence
User-Friendly Names Of Cached Queries in web UI’s Storage Tab
Dataset Checkpointing
UserDefinedAggregateFunction — Contract for User-Defined Untyped Aggregate Functions (UDAFs)
Aggregator — Contract for User-Defined Typed Aggregate Functions (UDAFs)
Configuration Properties
SparkSession Registries
Catalog — Metastore Management Interface
CatalogImpl
ExecutionListenerManager — Management Interface of QueryExecutionListeners
ExperimentalMethods
ExternalCatalog Contract — External Catalog (Metastore) of Permanent Relational Entities
InMemoryCatalog
HiveExternalCatalog — Hive-Aware Metastore of Permanent Relational Entities
FunctionRegistry — Contract for Function Registries (Catalogs)
GlobalTempViewManager — Management Interface of Global Temporary Views
SessionCatalog — Session-Scoped Catalog of Relational Entities
CatalogTable — Table Specification (Native Table Metadata)
CatalogStorageFormat — Storage Specification of Table or Partition
CatalogTablePartition — Partition Specification of Table
BucketSpec — Bucketing Specification of Table
HiveSessionCatalog — Hive-Specific Catalog of Relational Entities
HiveMetastoreCatalog — Legacy SessionCatalog for Converting Hive Metastore Relations to Data Source Relations
SessionState
BaseSessionStateBuilder — Generic Builder of SessionState
SessionStateBuilder
HiveSessionStateBuilder — Builder of Hive-Specific SessionState
SharedState — State Shared Across SparkSessions
CacheManager — In-Memory Cache for Tables and Views
RuntimeConfig — Management Interface of Runtime Configuration
SQLConf — Internal Configuration Store
StaticSQLConf — Cross-Session, Immutable and Static SQL Configuration
CatalystConf
UDFRegistration — Session-Scoped FunctionRegistry
Notable Features
Dynamic Partition Inserts
Bucketing
Whole-Stage Java Code Generation (Whole-Stage CodeGen)
CodegenContext
CodeGenerator
GenerateColumnAccessor
GenerateOrdering
GeneratePredicate
GenerateSafeProjection
BytesToBytesMap Append-Only Hash Map
Vectorized Query Execution (Batch Decoding)
ColumnarBatch
SupportsScanColumnarBatch
Vectorized Parquet Reader
VectorizedParquetRecordReader
SpecificParquetRecordReaderBase
DataSource V2
Subqueries
Hint Framework
Adaptive Query Execution
Subexpression Elimination For Code-Generated Expression Evaluation (Common Expression Reuse)
EquivalentExpressions
Cost-Based Optimization (CBO)
CatalogStatistics — Table Statistics in Metastore (External Catalog)
ColumnStat — Column Statistics
EstimationUtils
CommandUtils — Utilities for Table Statistics
Catalyst DSL — Implicit Conversions for Catalyst Data Structures
File-Based Data Sources
FileFormat
OrcFileFormat
ParquetFileFormat
TextBasedFileFormat
CSVFileFormat
JsonFileFormat
TextFileFormat
JsonDataSource
Kafka Data Source
Kafka Data Source
Kafka Data Source Options
KafkaSourceProvider
KafkaRelation
KafkaSourceRDD
KafkaSourceRDDOffsetRange
KafkaSourceRDDPartition
ConsumerStrategy Contract — Kafka Consumer Providers
KafkaOffsetReader
KafkaOffsetRangeLimit
KafkaDataConsumer Contract
InternalKafkaConsumer
KafkaWriter Helper Object — Writing Structured Queries to Kafka
KafkaWriteTask
JsonUtils Helper Object
Avro Data Source
Avro Data Source
AvroFileFormat — FileFormat For Avro-Encoded Files
AvroOptions — Avro Data Source Options
CatalystDataToAvro Unary Expression
AvroDataToCatalyst Unary Expression
JDBC Data Source
JDBC Data Source
JDBCOptions — JDBC Data Source Options
JdbcRelationProvider
JDBCRelation
JDBCRDD
JdbcDialect
JdbcUtils Helper Object
Hive Data Source
Hive Integration
Hive Metastore
Spark SQL CLI — spark-sql
DataSinks Strategy
HiveFileFormat
HiveClient
HiveClientImpl — The One and Only HiveClient
HiveUtils
Extending Spark SQL / Data Source API V1
DataSource — Pluggable Data Provider Framework
Custom Data Source Formats
Data Source Providers
CreatableRelationProvider Contract — Data Sources That Write Rows Per Save Mode
DataSourceRegister Contract — Registering Data Source Format
RelationProvider Contract — Relation Providers With Schema Inference
SchemaRelationProvider Contract — Relation Providers With Mandatory User-Defined Schema
Data Source Relations / Extension Contracts
BaseRelation — Collection of Tuples with Schema
HadoopFsRelation — Relation for File-Based Data Source
CatalystScan Contract
InsertableRelation Contract — Non-File-Based Relations with Inserting or Overwriting Data Support
PrunedFilteredScan Contract — Relations with Column Pruning and Filter Pushdown
PrunedScan Contract
TableScan Contract — Relations with Column Pruning
Others
FileFormatWriter Helper Object
Data Source Filter Predicate (For Filter Pushdown)
FileRelation Contract
Data Source API V2
DataSourceV2
DataSourceReader
SupportsPushDownFilters
DataReaderFactory
RowToUnsafeRowDataReaderFactory
DataSourceRDD — Input RDD Of DataSourceV2ScanExec Physical Operator
DataSourceRDDPartition
DataWriter
DataWritingSparkTask
DataWriterFactory
InternalRowDataWriterFactory
Structured Query Execution
QueryExecution — Structured Query Execution Pipeline
UnsupportedOperationChecker
Analyzer — Logical Query Plan Analyzer
CheckAnalysis — Analysis Validation
SparkOptimizer — Logical Query Plan Optimizer
Catalyst Optimizer — Generic Logical Query Plan Optimizer
SparkPlanner — Spark Query Planner
SparkStrategy — Base for Execution Planning Strategies
SparkStrategies — Container of Execution Planning Strategies
LogicalPlanStats — Statistics Estimates and Query Hints of Logical Operator
Statistics — Estimates of Plan Statistics and Query Hints
HintInfo
LogicalPlanVisitor — Base Visitor for Computing Statistics of Logical Plan
SizeInBytesOnlyStatsPlanVisitor — LogicalPlanVisitor for Total Size (in Bytes) Statistic Only
BasicStatsPlanVisitor — Computing Statistics for Cost-Based Optimization
AggregateEstimation
FilterEstimation
JoinEstimation
ProjectEstimation
Partitioning — Specification of Physical Operator’s Output Partitions
ExchangeCoordinator
Distribution — Contract For Data Distribution Across Partitions
AllTuples
BroadcastDistribution
ClusteredDistribution
HashClusteredDistribution
OrderedDistribution
UnspecifiedDistribution
Catalyst Expressions
Catalyst Expression — Executable Node in Catalyst Tree
AggregateExpression
AggregateFunction Contract — Aggregate Function Expressions
AggregateWindowFunction Contract — Declarative Window Aggregate Function Expressions
AttributeReference
Alias
Attribute
BoundReference
CallMethodViaReflection
Coalesce
CodegenFallback
CollectionGenerator
ComplexTypedAggregateExpression
CreateArray
CreateNamedStruct
CreateNamedStructLike Contract
CreateNamedStructUnsafe
CumeDist
DeclarativeAggregate Contract — Unevaluable Aggregate Function Expressions
ExecSubqueryExpression
Exists
ExpectsInputTypes Contract
ExplodeBase Contract
First
Generator
GetArrayStructFields
GetArrayItem
GetMapValue
GetStructField
ImperativeAggregate
In
Inline
InSet
InSubquery
JsonToStructs
JsonTuple
ListQuery
Literal
MonotonicallyIncreasingID
Murmur3Hash
NamedExpression Contract
Nondeterministic Contract
OffsetWindowFunction Contract — Unevaluable Window Function Expressions
ParseToDate
ParseToTimestamp
PlanExpression
PrettyAttribute
RankLike Contract
ResolvedStar
RowNumberLike Contract
RuntimeReplaceable Contract
ScalarSubquery SubqueryExpression
ScalarSubquery ExecSubqueryExpression
ScalaUDF
ScalaUDAF
SimpleTypedAggregateExpression
SizeBasedWindowFunction Contract — Declarative Window Aggregate Functions with Window Size
SortOrder
Stack
Star
StaticInvoke
SubqueryExpression
TimeWindow
TypedAggregateExpression
TypedImperativeAggregate
UnaryExpression Contract
UnixTimestamp
UnresolvedAttribute
UnresolvedFunction
UnresolvedGenerator
UnresolvedOrdinal
UnresolvedRegex
UnresolvedStar
UnresolvedWindowExpression
WindowExpression
WindowFunction Contract — Window Function Expressions With WindowFrame
WindowSpecDefinition
Logical Operators
LogicalPlan Contract — Logical Operator with Children and Expressions / Logical Query Plan
Command Contract — Eagerly-Executed Logical Operator
RunnableCommand Contract — Generic Logical Command with Side Effects
DataWritingCommand Contract — Logical Commands That Write Data
SaveAsHiveFile Contract — DataWritingCommands That Write Query Result As Hive Files
Concrete Logical Operators
Aggregate
AlterViewAsCommand
AnalysisBarrier
AnalyzeColumnCommand
AnalyzePartitionCommand
AnalyzeTableCommand
ClearCacheCommand
CreateDataSourceTableAsSelectCommand
CreateDataSourceTableCommand
CreateHiveTableAsSelectCommand
CreateTable
CreateTableCommand
CreateTempViewUsing
CreateViewCommand
DataSourceV2Relation
DescribeColumnCommand
DescribeTableCommand
DeserializeToObject
DropTableCommand
Except
Expand
ExplainCommand
ExternalRDD
Filter
Generate
GroupingSets
Hint
HiveTableRelation
InMemoryRelation
InsertIntoDataSourceCommand
InsertIntoDataSourceDirCommand
InsertIntoDir
InsertIntoHadoopFsRelationCommand
InsertIntoHiveDirCommand
InsertIntoHiveTable
InsertIntoTable
Intersect
Join
LeafNode
LocalRelation
LogicalRDD
LogicalRelation
OneRowRelation
Pivot
Project
Range
Repartition and RepartitionByExpression
ResolvedHint
SaveIntoDataSourceCommand
ShowCreateTableCommand
ShowTablesCommand
Sort
SubqueryAlias
TypedFilter
Union
UnresolvedCatalogRelation
UnresolvedHint
UnresolvedInlineTable
UnresolvedRelation
UnresolvedTableValuedFunction
Window
WithWindowDefinition
View
Physical Operators
SparkPlan Contract — Physical Operators in Physical Query Plan of Structured Query
CodegenSupport Contract — Physical Operators with Java Code Generation
DataSourceScanExec Contract — Leaf Physical Operators to Scan Over BaseRelation
ColumnarBatchScan Contract — Physical Operators With Vectorized Reader
ObjectConsumerExec Contract — Unary Physical Operators with Child Physical Operator with One-Attribute Output Schema
BaseLimitExec Contract
Exchange Contract
Projection Contract — Functions to Produce InternalRow for InternalRow
UnsafeProjection — Generic Function to Project InternalRows to UnsafeRows
GenerateUnsafeProjection
GenerateMutableProjection
InterpretedProjection
CodeGeneratorWithInterpretedFallback
SQLMetric — SQL Execution Metric of Physical Operator
Concrete Physical Operators
BroadcastExchangeExec
BroadcastHashJoinExec
BroadcastNestedLoopJoinExec
CartesianProductExec
CoalesceExec
DataSourceV2ScanExec
DataWritingCommandExec
DebugExec
DeserializeToObjectExec
ExecutedCommandExec
ExpandExec
ExternalRDDScanExec
FileSourceScanExec
FilterExec
GenerateExec
HashAggregateExec
HiveTableScanExec
InMemoryTableScanExec
LocalTableScanExec
MapElementsExec
ObjectHashAggregateExec
ProjectExec
RangeExec
RDDScanExec
ReusedExchangeExec
RowDataSourceScanExec
SampleExec
ShuffleExchangeExec
ShuffledHashJoinExec
SerializeFromObjectExec
SortAggregateExec
SortMergeJoinExec
SortExec
SubqueryExec
InputAdapter
WindowExec
AggregateProcessor
WindowFunctionFrame
WholeStageCodegenExec
Logical Analysis Rules (Check, Evaluation, Conversion and Resolution)
AliasViewChild
CleanupAliases
DataSourceAnalysis
DetermineTableStats
ExtractWindowExpressions
FindDataSourceTable
HandleNullInputsForUDF
HiveAnalysis
InConversion
LookupFunctions
PreprocessTableCreation
PreWriteCheck
RelationConversions
ResolveAliases
ResolveBroadcastHints
ResolveCoalesceHints
ResolveCreateNamedStruct
ResolveFunctions
ResolveHiveSerdeTable
ResolveInlineTables
ResolveMissingReferences
ResolveOrdinalInOrderByAndGroupBy
ResolveReferences
ResolveRelations
ResolveSQLOnFile
ResolveSubquery
ResolveWindowFrame
ResolveWindowOrder
TimeWindowing
UpdateOuterReferences
WindowFrameCoercion
WindowsSubstitution
Base Logical Optimizations (Optimizer)
CollapseWindow
ColumnPruning
CombineTypedFilters
CombineUnions
ComputeCurrentTime
ConstantFolding
CostBasedJoinReorder
DecimalAggregates
EliminateSerialization
EliminateSubqueryAliases
EliminateView
GetCurrentDatabase
LimitPushDown
NullPropagation
OptimizeIn
OptimizeSubqueries
PropagateEmptyRelation
PullupCorrelatedPredicates
PushDownPredicate
PushPredicateThroughJoin
ReorderJoin
ReplaceExpressions
RewriteCorrelatedScalarSubquery
RewritePredicateSubquery
SimplifyCasts
Extended Logical Optimizations (SparkOptimizer)
ExtractPythonUDFFromAggregate
OptimizeMetadataOnlyQuery
PruneFileSourcePartitions
PushDownOperatorsToDataSource
Execution Planning Strategies
Aggregation
BasicOperators
DataSourceStrategy
DataSourceV2Strategy
FileSourceStrategy
HiveTableScans
InMemoryScans
JoinSelection
SpecialLimits
Physical Query Optimizations
CollapseCodegenStages
EnsureRequirements
ExtractPythonUDFs
PlanSubqueries
ReuseExchange
ReuseSubquery
Encoders
Encoder — Internal Row Converter
Encoders Factory Object
ExpressionEncoder — Expression-Based Encoder
RowEncoder — Encoder for DataFrames
LocalDateTimeEncoder — Custom ExpressionEncoder for java.time.LocalDateTime
RDDs
FileScanRDD — Input RDD of FileSourceScanExec Physical Operator
ShuffledRowRDD
Monitoring
SQL Tab — Monitoring Structured Queries in web UI
SQLListener Spark Listener
QueryExecutionListener
SQLAppStatusListener Spark Listener
SQLAppStatusPlugin
SQLAppStatusStore
WriteTaskStats
BasicWriteTaskStats
WriteTaskStatsTracker
BasicWriteTaskStatsTracker
WriteJobStatsTracker
BasicWriteJobStatsTracker
Logging
Performance Tuning and Debugging
Spark SQL’s Performance Tuning Tips and Tricks (aka Case Studies)
Number of Partitions for groupBy Aggregation
Debugging Query Execution
Catalyst — Tree Manipulation Framework
Catalyst — Tree Manipulation Framework
TreeNode — Node in Catalyst Tree
QueryPlan — Structured Query Plan
RuleExecutor Contract — Tree Transformation Rule Executor
Catalyst Rule — Named Transformation of TreeNodes
QueryPlanner — Converting Logical Plan to Physical Trees
GenericStrategy
Tungsten Execution Backend
Tungsten Execution Backend (Project Tungsten)
InternalRow — Abstract Binary Row Format
UnsafeRow — Mutable Raw-Memory Unsafe Binary Row Format
AggregationIterator — Generic Iterator of UnsafeRows for Aggregate Physical Operators
ObjectAggregationIterator
SortBasedAggregationIterator
TungstenAggregationIterator — Iterator of UnsafeRows for HashAggregateExec Physical Operator
CatalystSerde
ExternalAppendOnlyUnsafeRowArray — Append-Only Array for UnsafeRows (with Disk Spill Threshold)
UnsafeFixedWidthAggregationMap
SQL Support
SQL Parsing Framework
AbstractSqlParser — Base SQL Parsing Infrastructure
AstBuilder — ANTLR-based SQL Parser
CatalystSqlParser — DataTypes and StructTypes Parser
ParserInterface — SQL Parser Contract
SparkSqlAstBuilder
SparkSqlParser — Default SQL Parser
Spark Thrift Server
Thrift JDBC/ODBC Server — Spark Thrift Server (STS)
SparkSQLEnv
Varia / Uncategorized
SQLExecution Helper Object
RDDConversions Helper Object
CatalystTypeConverters Helper Object
StatFunctions Helper Object
SubExprUtils Helper Object
PredicateHelper Scala Trait
SchemaUtils Helper Object
AggUtils Helper Object
ScalaReflection
CreateStruct Function Builder
MultiInstanceRelation
TypeCoercion Object
TypeCoercionRule — Contract For Type Coercion Rules
ExtractEquiJoinKeys — Scala Extractor for Destructuring Join Logical Operators
PhysicalAggregation — Scala Extractor for Destructuring Aggregate Logical Operators
PhysicalOperation — Scala Extractor for Destructuring Logical Query Plans
HashJoin — Contract for Hash-based Join Physical Operators
HashedRelation
LongHashedRelation
UnsafeHashedRelation
KnownSizeEstimation
SizeEstimator
BroadcastMode
HashedRelationBroadcastMode
IdentityBroadcastMode
PartitionedFile — Part of Single File
PartitioningUtils
ColumnVector
WritableColumnVector
OnHeapColumnVector
OffHeapColumnVector
HadoopFileLinesReader
CatalogUtils Helper Object
ExternalCatalogUtils
PartitioningAwareFileIndex
BufferedRowIterator
CompressionCodecs
(obsolete) SQLContext
Published with GitBook
GenericStrategy
GenericStrategy
Executing Planning Strategy —
apply
Method
Caution
FIXME
results matching "
"
No results matching "
"