// Demo the different cases when `HadoopFsRelation` is created
import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, LogicalRelation}
// Example 1: spark.table for DataSource tables (provider != hive)
import org.apache.spark.sql.catalyst.TableIdentifier
val t1ID = TableIdentifier(tableName = "t1")
spark.sessionState.catalog.dropTable(name = t1ID, ignoreIfNotExists = true, purge = true)
spark.range(5).write.saveAsTable("t1")
val metadata = spark.sessionState.catalog.getTableMetadata(t1ID)
scala> println(metadata.provider.get)
parquet
assert(metadata.provider.get != "hive")
val q = spark.table("t1")
// Avoid dealing with UnresolvedRelations and SubqueryAliases
// Hence going stright for optimizedPlan
val plan1 = q.queryExecution.optimizedPlan
scala> println(plan1.numberedTreeString)
00 Relation[id#7L] parquet
val LogicalRelation(rel1, _, _, _) = plan1.asInstanceOf[LogicalRelation]
val hadoopFsRel = rel1.asInstanceOf[HadoopFsRelation]
// Example 2: spark.read with format as a `FileFormat`
val q = spark.read.text("README.md")
val plan2 = q.queryExecution.logical
scala> println(plan2.numberedTreeString)
00 Relation[value#2] text
val LogicalRelation(relation, _, _, _) = plan2.asInstanceOf[LogicalRelation]
val hadoopFsRel = relation.asInstanceOf[HadoopFsRelation]
// Example 3: Bucketing specified
val tableName = "bucketed_4_id"
spark
.range(100000000)
.write
.bucketBy(4, "id")
.sortBy("id")
.mode("overwrite")
.saveAsTable(tableName)
val q = spark.table(tableName)
// Avoid dealing with UnresolvedRelations and SubqueryAliases
// Hence going stright for optimizedPlan
val plan3 = q.queryExecution.optimizedPlan
scala> println(plan3.numberedTreeString)
00 Relation[id#52L] parquet
val LogicalRelation(rel3, _, _, _) = plan3.asInstanceOf[LogicalRelation]
val hadoopFsRel = rel3.asInstanceOf[HadoopFsRelation]
val bucketSpec = hadoopFsRel.bucketSpec.get
// Exercise 3: spark.table for Hive tables (provider == hive)
HadoopFsRelation — Relation Of File-Based Data Sources
HadoopFsRelation
is a BaseRelation and FileRelation.
HadoopFsRelation
is created when:
-
DataSource
is requested to resolve a relation for file-based data sources -
HiveMetastoreCatalog
is requested to convert a HiveTableRelation to a LogicalRelation over a HadoopFsRelation (for RelationConversions logical post-hoc evaluation rule forparquet
ornative
andhive
ORC formats)
The optional bucketing specification is defined exclusively for non-streaming file-based data sources and used for the following:
-
Output partitioning scheme and output data ordering of the corresponding FileSourceScanExec physical operator
-
DataSourceAnalysis post-hoc logical resolution rule (when executed on a InsertIntoTable logical operator over a LogicalRelation with
HadoopFsRelation
relation)
Creating HadoopFsRelation Instance
HadoopFsRelation
takes the following to be created:
-
FileIndex (for sizeInBytes and inputFiles)
-
Partition schema
-
Data schema
-
Bucketing specification (optional)
HadoopFsRelation
initializes the internal properties.
Files to Scan — inputFiles
Method
inputFiles: Array[String]
Note
|
inputFiles is part of the FileRelation Contract for the list of files to read for scanning this relation.
|
inputFiles
simply requests the FileIndex for the inputFiles.
sizeInBytes
Method
sizeInBytes: Long
Note
|
sizeInBytes is part of the BaseRelation Contract for the estimated size of a relation (in bytes).
|
sizeInBytes
…FIXME
Human-Friendly Textual Representation — toString
Method
toString: String
Note
|
toString is part of the java.lang.Object contract for the string representation of the object.
|
toString
is the following text based on the FileFormat:
-
shortName for DataSourceRegister data sources
-
HadoopFiles otherwise