PartitioningAwareFileIndex

PartitioningAwareFileIndex is an extension of the FileIndex contract for indices that are aware of partitioned tables.

Table 1. PartitioningAwareFileIndex Contract (Abstract Methods Only)
Method Description

leafDirToChildrenFiles

leafDirToChildrenFiles: Map[Path, Array[FileStatus]]

Used when PartitioningAwareFileIndex is requested to listFiles, allFiles, and inferPartitioning

leafFiles

leafFiles: LinkedHashMap[Path, FileStatus]

Used when PartitioningAwareFileIndex is requested for all files and base paths

partitionSpec

partitionSpec(): PartitionSpec

Partition specification (partition columns, their directories as Hadoop Paths and partition values)

Used when PartitioningAwareFileIndex is requested for the partition schema, files, and all files

Table 2. PartitioningAwareFileIndexes (Direct Implementations and Extensions Only)
PartitioningAwareFileIndex Description

InMemoryFileIndex

MetadataLogFileIndex

Spark Structured Streaming

Creating PartitioningAwareFileIndex Instance

PartitioningAwareFileIndex takes the following to be created:

  • SparkSession

  • Options for partition discovery

  • Optional user-defined schema

  • FileStatusCache (default: NoopCache)

PartitioningAwareFileIndex initializes the internal properties.

Note
PartitioningAwareFileIndex is an abstract class and cannot be created directly. It is created indirectly for the concrete PartitioningAwareFileIndices.

listFiles Method

listFiles(
  partitionFilters: Seq[Expression],
  dataFilters: Seq[Expression]): Seq[PartitionDirectory]
Note
listFiles is part of the FileIndex contract.

listFiles…​FIXME

partitionSchema Method

partitionSchema: StructType
Note
partitionSchema is part of the FileIndex contract.

partitionSchema simply returns the partition columns (as a StructType) of the partition specification.

inputFiles Method

inputFiles: Array[String]
Note
inputFiles is part of the FileIndex contract.

inputFiles simply returns the location of all the files.

sizeInBytes Method

sizeInBytes: Long
Note
sizeInBytes is part of the FileIndex contract.

sizeInBytes simply sums up the length (in bytes) of all the files.

allFiles Method

allFiles(): Seq[FileStatus]

allFiles…​FIXME

Note

allFiles is used when:

inferPartitioning Method

inferPartitioning(): PartitionSpec

inferPartitioning…​FIXME

Note
inferPartitioning is used when InMemoryFileIndex and Spark Structured Streaming’s MetadataLogFileIndex are requested for the partitionSpec.

basePaths Internal Method

basePaths: Set[Path]

basePaths…​FIXME

Note
basePaths is used when PartitioningAwareFileIndex is requested to inferPartitioning.

Internal Properties

Name Description

hadoopConf

Hadoop Configuration

results matching ""

    No results matching ""