The Internals of Spark SQL
Introduction
Spark SQL — Structured Data Processing with Relational Queries on Massive Scale
Datasets vs DataFrames vs RDDs
Dataset API vs SQL
Hive Integration / Hive Data Source
Hive Data Source
Demo: Connecting Spark SQL to Hive Metastore (with Remote Metastore Server)
Demo: Hive Partitioned Parquet Table and Partition Pruning
Configuration Properties
Hive Metastore
DataSinks Strategy
HiveFileFormat
HiveClient
HiveClientImpl
HiveUtils
IsolatedClientLoader
HiveTableRelation
CreateHiveTableAsSelectCommand
SaveAsHiveFile
InsertIntoHiveDirCommand
InsertIntoHiveTable
HiveTableScans
HiveTableScanExec
TableReader
HadoopTableReader
HiveSessionStateBuilder
HiveExternalCatalog
HiveSessionCatalog
HiveMetastoreCatalog
RelationConversions
ResolveHiveSerdeTable
DetermineTableStats
HiveAnalysis
Notable Features
Hive Integration
Vectorized Parquet Decoding (Reader)
Dynamic Partition Inserts
Bucketing
Whole-Stage Java Code Generation (Whole-Stage CodeGen)
CodegenContext
CodeGenerator
GenerateColumnAccessor
GenerateOrdering
GeneratePredicate
GenerateSafeProjection
BytesToBytesMap Append-Only Hash Map
Vectorized Query Execution (Batch Decoding)
ColumnarBatch — ColumnVectors as Row-Wise Table
Data Source API V2
Subqueries
Hint Framework
Adaptive Query Execution
ExchangeCoordinator
Subexpression Elimination For Code-Generated Expression Evaluation (Common Expression Reuse)
EquivalentExpressions
Cost-Based Optimization (CBO)
CatalogStatistics — Table Statistics in Metastore (External Catalog)
ColumnStat — Column Statistics
EstimationUtils
CommandUtils — Utilities for Table Statistics
Catalyst DSL — Implicit Conversions for Catalyst Data Structures
Spark SQL CLI — spark-sql
Developing Spark SQL Applications
Fundamentals of Spark SQL Application Development
SparkSession — The Entry Point to Spark SQL
Builder — Building SparkSession using Fluent API
implicits Object — Implicits Conversions
SparkSessionExtensions
Dataset — Structured Query with Data Encoder
DataFrame — Dataset of Rows with RowEncoder
Row
DataSource API — Managing Datasets in External Data Sources
DataFrameReader — Loading Data From External Data Sources
DataFrameWriter — Saving Data To External Data Sources
Dataset API — Dataset Operators
Typed Transformations
Untyped Transformations
Basic Actions
Actions
DataFrameNaFunctions — Working With Missing Data
DataFrameStatFunctions — Working With Statistic Functions
Column
Column API — Column Operators
TypedColumn
Basic Aggregation — Typed and Untyped Grouping Operators
RelationalGroupedDataset — Untyped Row-based Grouping
KeyValueGroupedDataset — Typed Grouping
Dataset Join Operators
Broadcast Joins (aka Map-Side Joins)
Window Aggregation
WindowSpec — Window Specification
Window Utility Object — Defining Window Specification
Standard Functions — functions Object
Aggregate Functions
Collection Functions
Date and Time Functions
Regular Functions (Non-Aggregate Functions)
Window Aggregation Functions
User-Defined Functions (UDFs)
UDFs are Blackbox — Don’t Use Them Unless You’ve Got No Choice
UserDefinedFunction
Schema — Structure of Data
StructType
StructField — Single Field in StructType
Data Types
Multi-Dimensional Aggregation
Dataset Caching and Persistence
User-Friendly Names Of Cached Queries in web UI’s Storage Tab
Dataset Checkpointing
UserDefinedAggregateFunction — Contract for User-Defined Untyped Aggregate Functions (UDAFs)
Aggregator — Contract for User-Defined Typed Aggregate Functions (UDAFs)
Configuration Properties
SparkSession Registries
Catalog — Metastore Management Interface
CatalogImpl
ExecutionListenerManager — Management Interface of QueryExecutionListeners
ExperimentalMethods
ExternalCatalog Contract — External Catalog (Metastore) of Permanent Relational Entities
InMemoryCatalog
FunctionRegistry — Contract for Function Registries (Catalogs)
GlobalTempViewManager — Management Interface of Global Temporary Views
SessionCatalog — Session-Scoped Catalog of Relational Entities
CatalogTable — Table Specification (Native Table Metadata)
CatalogStorageFormat — Storage Specification of Table or Partition
CatalogTablePartition — Partition Specification of Table
BucketSpec — Bucketing Specification of Table
SessionState
BaseSessionStateBuilder — Generic Builder of SessionState
SessionStateBuilder
SharedState — State Shared Across SparkSessions
CacheManager — In-Memory Cache for Tables and Views
CachedRDDBuilder
RuntimeConfig — Management Interface of Runtime Configuration
SQLConf
StaticSQLConf
CatalystConf
UDFRegistration — Session-Scoped FunctionRegistry
File-Based Data Sources
FileFormat
OrcFileFormat
ParquetFileFormat
TextBasedFileFormat
CSVFileFormat
JsonFileFormat
TextFileFormat
JsonDataSource
HadoopFsRelation
FileIndex
CatalogFileIndex
InMemoryFileIndex
PartitioningAwareFileIndex
PrunedInMemoryFileIndex
FileFormatWriter
SQLHadoopMapReduceCommitProtocol
PartitionedFile
FileScanRDD
ParquetReadSupport
RecordReaderIterator
Kafka Data Source
Kafka Data Source
Kafka Data Source Options
KafkaSourceProvider
KafkaRelation
KafkaSourceRDD
KafkaSourceRDDOffsetRange
KafkaSourceRDDPartition
ConsumerStrategy Contract — Kafka Consumer Providers
KafkaOffsetReader
KafkaOffsetRangeLimit
KafkaDataConsumer Contract
InternalKafkaConsumer
KafkaWriter Helper Object — Writing Structured Queries to Kafka
KafkaWriteTask
JsonUtils Helper Object
Avro Data Source
Avro Data Source
AvroFileFormat — FileFormat For Avro-Encoded Files
AvroOptions — Avro Data Source Options
CatalystDataToAvro Unary Expression
AvroDataToCatalyst Unary Expression
JDBC Data Source
JDBC Data Source
JDBCOptions
JdbcRelationProvider
JDBCRelation
JDBCRDD
JdbcDialect
JdbcUtils Helper Object
Console Data Source
Console Data Source
ConsoleSinkProvider
Extending Spark SQL / Data Source API V2
DataSourceV2
ReadSupport Contract
WriteSupport Contract
DataSourceReader
SupportsPushDownFilters
SupportsPushDownRequiredColumns
SupportsReportPartitioning
SupportsReportStatistics
SupportsScanColumnarBatch
DataSourceWriter
SessionConfigSupport
InputPartition
InputPartitionReader
DataWriter
DataWriterFactory
InternalRowDataWriterFactory
DataSourceV2StringFormat
DataSourceRDD
DataSourceRDDPartition
DataWritingSparkTask Partition Processing Function
DataSourceV2Utils Helper Object
Extending Spark SQL / Data Source API V1
DataSource
Custom Data Source Formats
Data Source Providers / Relation Providers
CreatableRelationProvider
DataSourceRegister
RelationProvider
SchemaRelationProvider
Data Source Relations / Extension Contracts
BaseRelation
FileRelation
CatalystScan
InsertableRelation
PrunedFilteredScan
PrunedScan
TableScan
Others
Data Source Filter Predicate (For Filter Pushdown)
Structured Query Execution
QueryExecution
UnsupportedOperationChecker
Analyzer
CheckAnalysis
SparkOptimizer
Catalyst Optimizer
SparkPlanner
SparkStrategy
SparkStrategies
LogicalPlanStats
Statistics
HintInfo
LogicalPlanVisitor
SizeInBytesOnlyStatsPlanVisitor
BasicStatsPlanVisitor
AggregateEstimation
FilterEstimation
JoinEstimation
ProjectEstimation
Partitioning
HashPartitioning
Distribution
AllTuples
BroadcastDistribution
ClusteredDistribution
HashClusteredDistribution
OrderedDistribution
UnspecifiedDistribution
Catalyst Expressions
Catalyst Expression — Executable Node in Catalyst Tree
AggregateExpression
AggregateFunction Contract — Aggregate Function Expressions
AggregateWindowFunction Contract — Declarative Window Aggregate Function Expressions
AttributeReference
Alias
Attribute
BoundReference
CallMethodViaReflection
Coalesce
CodegenFallback
CollectionGenerator
ComplexTypedAggregateExpression
CreateArray
CreateNamedStruct
CreateNamedStructLike Contract
CreateNamedStructUnsafe
CumeDist
DeclarativeAggregate Contract — Unevaluable Aggregate Function Expressions
ExecSubqueryExpression
Exists
ExpectsInputTypes Contract
ExplodeBase Contract
First
Generator
GetArrayStructFields
GetArrayItem
GetMapValue
GetStructField
ImperativeAggregate
In
Inline
InSet
InSubquery
JsonToStructs
JsonTuple
ListQuery
Literal
MonotonicallyIncreasingID
Murmur3Hash
NamedExpression Contract
Nondeterministic Contract
OffsetWindowFunction Contract — Unevaluable Window Function Expressions
ParseToDate
ParseToTimestamp
PlanExpression
PrettyAttribute
RankLike Contract
ResolvedStar
RowNumberLike Contract
RuntimeReplaceable Contract
ScalarSubquery SubqueryExpression
ScalarSubquery ExecSubqueryExpression
ScalaUDF
ScalaUDAF
SimpleTypedAggregateExpression
SizeBasedWindowFunction Contract — Declarative Window Aggregate Functions with Window Size
SortOrder
Stack
Star
StaticInvoke
SubqueryExpression
TimeWindow
TypedAggregateExpression
TypedImperativeAggregate
UnaryExpression Contract
UnixTimestamp
UnresolvedAttribute
UnresolvedFunction
UnresolvedGenerator
UnresolvedOrdinal
UnresolvedRegex
UnresolvedStar
UnresolvedWindowExpression
WindowExpression
WindowFunction Contract — Window Function Expressions With WindowFrame
WindowSpecDefinition
Logical Operators
Base Logical Operators (Contracts)
LogicalPlan Contract — Logical Operator with Children and Expressions / Logical Query Plan
Command Contract — Eagerly-Executed Logical Operator
RunnableCommand Contract — Generic Logical Command with Side Effects
DataWritingCommand Contract — Logical Commands That Write Query Data
Concrete Logical Operators
Aggregate
AlterViewAsCommand
AnalysisBarrier
AnalyzeColumnCommand
AnalyzePartitionCommand
AnalyzeTableCommand
AppendData
ClearCacheCommand
CreateDataSourceTableAsSelectCommand
CreateDataSourceTableCommand
CreateTable
CreateTableCommand
CreateTempViewUsing
CreateViewCommand
DataSourceV2Relation
DescribeColumnCommand
DescribeTableCommand
DeserializeToObject
DropTableCommand
Except
Expand
ExplainCommand
ExternalRDD
Filter
Generate
GroupingSets
Hint
InMemoryRelation
InsertIntoDataSourceCommand
InsertIntoDataSourceDirCommand
InsertIntoDir
InsertIntoHadoopFsRelationCommand
InsertIntoTable
Intersect
Join
LeafNode
LocalRelation
LogicalRDD
LogicalRelation
OneRowRelation
Pivot
Project
Range
Repartition and RepartitionByExpression
ResolvedHint
SaveIntoDataSourceCommand
ShowCreateTableCommand
ShowTablesCommand
Sort
SubqueryAlias
TruncateTableCommand
TypedFilter
Union
UnresolvedCatalogRelation
UnresolvedHint
UnresolvedInlineTable
UnresolvedRelation
UnresolvedTableValuedFunction
Window
WithWindowDefinition
WriteToDataSourceV2
View
Physical Operators
SparkPlan Contract — Physical Operators in Physical Query Plan of Structured Query
CodegenSupport Contract — Physical Operators with Java Code Generation
DataSourceScanExec Contract — Leaf Physical Operators to Scan Over BaseRelation
ColumnarBatchScan Contract — Physical Operators With Vectorized Reader
ObjectConsumerExec Contract — Unary Physical Operators with Child Physical Operator with One-Attribute Output Schema
BaseLimitExec Contract
Exchange Contract
Projection Contract — Functions to Produce InternalRow for InternalRow
UnsafeProjection — Generic Function to Project InternalRows to UnsafeRows
GenerateUnsafeProjection
GenerateMutableProjection
InterpretedProjection
CodeGeneratorWithInterpretedFallback
SQLMetric — SQL Execution Metric of Physical Operator
Concrete Physical Operators
BroadcastExchangeExec
BroadcastHashJoinExec
BroadcastNestedLoopJoinExec
CartesianProductExec
CoalesceExec
CoGroupExec
DataSourceV2ScanExec
DataWritingCommandExec
DebugExec
DeserializeToObjectExec
ExecutedCommandExec
ExpandExec
ExternalRDDScanExec
FileSourceScanExec
FilterExec
GenerateExec
HashAggregateExec
InMemoryTableScanExec
LocalTableScanExec
MapElementsExec
ObjectHashAggregateExec
ObjectProducerExec
ProjectExec
RangeExec
RDDScanExec
ReusedExchangeExec
RowDataSourceScanExec
SampleExec
ShuffleExchangeExec
ShuffledHashJoinExec
SerializeFromObjectExec
SortAggregateExec
SortMergeJoinExec
SortExec
SubqueryExec
InputAdapter
WindowExec
AggregateProcessor
WindowFunctionFrame
WholeStageCodegenExec
WriteToDataSourceV2Exec
Logical Analysis Rules (Check, Evaluation, Conversion and Resolution)
AliasViewChild
CleanupAliases
DataSourceAnalysis
ExtractWindowExpressions
FindDataSourceTable
HandleNullInputsForUDF
InConversion
LookupFunctions
PreprocessTableCreation
PreWriteCheck
ResolveAliases
ResolveBroadcastHints
ResolveCoalesceHints
ResolveCreateNamedStruct
ResolveFunctions
ResolveInlineTables
ResolveMissingReferences
ResolveOrdinalInOrderByAndGroupBy
ResolveOutputRelation
ResolveReferences
ResolveRelations
ResolveSQLOnFile
ResolveSubquery
ResolveWindowFrame
ResolveWindowOrder
TimeWindowing
UpdateOuterReferences
WindowFrameCoercion
WidenSetOperationTypes
WindowsSubstitution
Base Logical Optimizations (Optimizer)
CollapseWindow
ColumnPruning
CombineTypedFilters
CombineUnions
ComputeCurrentTime
ConstantFolding
CostBasedJoinReorder
DecimalAggregates
EliminateSerialization
EliminateSubqueryAliases
EliminateView
GetCurrentDatabase
InferFiltersFromConstraints
LimitPushDown
NullPropagation
OptimizeIn
OptimizeSubqueries
PropagateEmptyRelation
PullupCorrelatedPredicates
PushDownPredicate
PushPredicateThroughJoin
ReorderJoin
ReplaceExpressions
ReplaceExceptWithAntiJoin
ReplaceExceptWithFilter
RewriteCorrelatedScalarSubquery
RewriteExceptAll
RewritePredicateSubquery
SimplifyCasts
Extended Logical Optimizations (SparkOptimizer)
ExtractPythonUDFFromAggregate
OptimizeMetadataOnlyQuery
PruneFileSourcePartitions
PushDownOperatorsToDataSource
Execution Planning Strategies
Aggregation
BasicOperators
DataSourceStrategy
DataSourceV2Strategy
FileSourceStrategy
InMemoryScans
JoinSelection
SpecialLimits
Physical Query Optimizations
CollapseCodegenStages
EnsureRequirements
ExtractPythonUDFs
PlanSubqueries
ReuseExchange
ReuseSubquery
Encoders
Encoder — Internal Row Converter
Encoders Factory Object
ExpressionEncoder — Expression-Based Encoder
RowEncoder — Encoder for DataFrames
LocalDateTimeEncoder — Custom ExpressionEncoder for java.time.LocalDateTime
Vectorized Parquet Decoding
VectorizedParquetRecordReader
VectorizedColumnReader
SpecificParquetRecordReaderBase
ColumnVector Contract — In-Memory Columnar Data
WritableColumnVector Contract
OnHeapColumnVector
OffHeapColumnVector
RDDs
ShuffledRowRDD
Monitoring
SQL Tab — Monitoring Structured Queries in web UI
SQLListener Spark Listener
QueryExecutionListener
SQLAppStatusListener Spark Listener
SQLAppStatusPlugin
SQLAppStatusStore
WriteTaskStats
BasicWriteTaskStats
WriteTaskStatsTracker
BasicWriteTaskStatsTracker
WriteJobStatsTracker
BasicWriteJobStatsTracker
Logging
Performance Tuning and Debugging
Spark SQL’s Performance Tuning Tips and Tricks (aka Case Studies)
Number of Partitions for groupBy Aggregation
Debugging Query Execution
Catalyst — Tree Manipulation Framework
Catalyst — Tree Manipulation Framework
TreeNode — Node in Catalyst Tree
QueryPlan — Structured Query Plan
RuleExecutor Contract — Tree Transformation Rule Executor
Catalyst Rule — Named Transformation of TreeNodes
QueryPlanner — Converting Logical Plan to Physical Trees
GenericStrategy
Tungsten Execution Backend
Tungsten Execution Backend (Project Tungsten)
InternalRow — Abstract Binary Row Format
UnsafeRow — Mutable Raw-Memory Unsafe Binary Row Format
AggregationIterator — Generic Iterator of UnsafeRows for Aggregate Physical Operators
ObjectAggregationIterator
SortBasedAggregationIterator
TungstenAggregationIterator — Iterator of UnsafeRows for HashAggregateExec Physical Operator
CatalystSerde
ExternalAppendOnlyUnsafeRowArray — Append-Only Array for UnsafeRows (with Disk Spill Threshold)
UnsafeFixedWidthAggregationMap
SQL Support
SQL Parsing Framework
AbstractSqlParser
AstBuilder
CatalystSqlParser
ParserInterface
SparkSqlAstBuilder
SparkSqlParser
Spark Thrift Server
Thrift JDBC/ODBC Server — Spark Thrift Server (STS)
SparkSQLEnv
Varia / Uncategorized
SQLExecution Helper Object
RDDConversions Helper Object
CatalystTypeConverters Helper Object
StatFunctions Helper Object
SubExprUtils Helper Object
PredicateHelper Scala Trait
DDLUtils Helper Object
SchemaUtils Helper Object
AggUtils Helper Object
ScalaReflection
CreateStruct Function Builder
MultiInstanceRelation
TypeCoercion Object
TypeCoercionRule
ExtractEquiJoinKeys
PhysicalAggregationl Operators
PhysicalOperation
HashJoin
HashedRelation
LongHashedRelation
UnsafeHashedRelation
KnownSizeEstimation
SizeEstimator
BroadcastMode
HashedRelationBroadcastMode
IdentityBroadcastMode
PartitioningUtils
HadoopFileLinesReader
CatalogUtils Helper Object
ExternalCatalogUtils
BufferedRowIterator
CompressionCodecs
(obsolete) SQLContext
Powered by
GitBook
SupportsReportStatistics
SupportsReportStatistics
SupportsReportStatistics
is…FIXME
results matching "
"
No results matching "
"