HadoopTableReader

HadoopTableReader is a TableReader to create an HadoopRDD for scanning partitioned or unpartitioned tables stored in Hadoop.

HadoopTableReader is used by HiveTableScanExec physical operator when requested to execute.

Creating HadoopTableReader Instance

HadoopTableReader takes the following to be created:

Attributes
Partition Keys (Seq[Attribute])
Hive TableDesc
SparkSession
Hadoop Configuration

HadoopTableReader initializes the internal properties.

`makeRDDForTable` Method

makeRDDForTable(
  hiveTable: HiveTable): RDD[InternalRow]

Note	`makeRDDForTable` is part of the TableReader contract to…FIXME.

makeRDDForTable simply calls the private makeRDDForTable with…FIXME

`makeRDDForTable` Method

makeRDDForTable(
  hiveTable: HiveTable,
  deserializerClass: Class[_ <: Deserializer],
  filterOpt: Option[PathFilter]): RDD[InternalRow]

makeRDDForTable…FIXME

Note	`makeRDDForTable` is used when…FIXME

`makeRDDForPartitionedTable` Method

makeRDDForPartitionedTable(
  partitions: Seq[HivePartition]): RDD[InternalRow]

Note	`makeRDDForPartitionedTable` is part of the TableReader contract to…FIXME.

makeRDDForPartitionedTable simply calls the private makeRDDForPartitionedTable with…FIXME

`makeRDDForPartitionedTable` Method

makeRDDForPartitionedTable(
  partitionToDeserializer: Map[HivePartition, Class[_ <: Deserializer]],
  filterOpt: Option[PathFilter]): RDD[InternalRow]

makeRDDForPartitionedTable…FIXME

Note	`makeRDDForPartitionedTable` is used when…FIXME

Creating HadoopRDD — `createHadoopRdd` Internal Method

createHadoopRdd(
  tableDesc: TableDesc,
  path: String,
  inputFormatClass: Class[InputFormat[Writable, Writable]]): RDD[Writable]

createHadoopRdd initializeLocalJobConfFunc for the input path and tableDesc.

createHadoopRdd creates an HadoopRDD (with the broadcast Hadoop Configuration, the input inputFormatClass, and the minimum number of partitions) and takes (maps over) the values.

Note	`createHadoopRdd` adds a `HadoopRDD` and a `MapPartitionsRDD` to a RDD lineage.

Note	`createHadoopRdd` is used when `HadoopTableReader` is requested to makeRDDForTable and makeRDDForPartitionedTable.

`initializeLocalJobConfFunc` Utility

initializeLocalJobConfFunc(
  path: String,
  tableDesc: TableDesc)(
    jobConf: JobConf): Unit

initializeLocalJobConfFunc…FIXME

Note	`initializeLocalJobConfFunc` is used when `HadoopTableReader` is requested to create an HadoopRDD.

Internal Properties

Name Description

Name	Description
`_broadcastedHadoopConf`	Hadoop Configuration broadcast to executors
`_minSplitsPerRDD`	Minimum number of partitions for a HadoopRDD: `0` for local mode The greatest of Hadoop’s `mapreduce.job.maps` (default: `1`) and Spark Core’s default minimum number of partitions for Hadoop RDDs (not higher than `2`)

_broadcastedHadoopConf

Hadoop Configuration broadcast to executors

_minSplitsPerRDD

Minimum number of partitions for a HadoopRDD:

0 for local mode
The greatest of Hadoop’s mapreduce.job.maps (default: 1) and Spark Core’s default minimum number of partitions for Hadoop RDDs (not higher than 2)

HadoopTableReader