CacheManager — In-Memory Cache for Tables and Views

CacheManager is an in-memory cache (registry) for structured queries (by their logical plans).

CacheManager is shared across SparkSessions through SharedState.

val spark: SparkSession = ...
A Spark developer can use CacheManager to cache Datasets using cache or persist operators.

CacheManager uses the cachedData internal registry to manage cached structured queries and their InMemoryRelation cached representation.

CacheManager can be empty.

CacheManager uses CachedData data structure for managing cached structured queries with the LogicalPlan (of a structured query) and a corresponding InMemoryRelation leaf logical operator.


Enable ALL logging level for org.apache.spark.sql.execution.CacheManager logger to see what happens inside.

Add the following line to conf/

Refer to Logging.

Cached Structured Queries — cachedData Internal Registry

cachedData: LinkedList[CachedData]

cachedData is a collection of CachedData.

A new CachedData added when CacheManager is requested to:

A CachedData removed when CacheManager is requested to:

All CachedData removed (cleared) when CacheManager is requested to clearCache

lookupCachedData Method

lookupCachedData(query: Dataset[_]): Option[CachedData]
lookupCachedData(plan: LogicalPlan): Option[CachedData]



lookupCachedData is used when:

Un-caching Dataset — uncacheQuery Method

  query: Dataset[_],
  cascade: Boolean,
  blocking: Boolean = true): Unit
  spark: SparkSession,
  plan: LogicalPlan,
  cascade: Boolean,
  blocking: Boolean): Unit



uncacheQuery is used when:

isEmpty Method

isEmpty: Boolean

isEmpty simply says whether there are any CachedData entries in the cachedData internal registry.

Caching Dataset — cacheQuery Method

  query: Dataset[_],
  tableName: Option[String] = None,
  storageLevel: StorageLevel = MEMORY_AND_DISK): Unit

cacheQuery adds the analyzed logical plan of the input Dataset to the cachedData internal registry of cached queries.

Internally, cacheQuery requests the Dataset for the analyzed logical plan and creates a InMemoryRelation with the following properties:

cacheQuery then creates a CachedData (for the analyzed query plan and the InMemoryRelation) and adds it to the cachedData internal registry.

If the input query has already been cached, cacheQuery simply prints the following WARN message to the logs and exits (i.e. does nothing but prints out the WARN message):

Asked to cache already cached data.

cacheQuery is used when:

Removing All Cached Logical Plans — clearCache Method

clearCache(): Unit

clearCache takes every CachedData from the cachedData internal registry and requests it for the InMemoryRelation to access the CachedRDDBuilder. clearCache requests the CachedRDDBuilder to clearCache.

In the end, clearCache removes all CachedData entries from the cachedData internal registry.

clearCache is used exclusively when CatalogImpl is requested to clear the cache.

Re-Caching Structured Query — recacheByCondition Internal Method

recacheByCondition(spark: SparkSession, condition: LogicalPlan => Boolean): Unit


recacheByCondition is used when CacheManager is requested to uncache a structured query, recacheByPlan, and recacheByPath.

recacheByPlan Method

recacheByPlan(spark: SparkSession, plan: LogicalPlan): Unit


recacheByPlan is used exclusively when InsertIntoDataSourceCommand logical command is executed.

recacheByPath Method

recacheByPath(spark: SparkSession, resourcePath: String): Unit


recacheByPath is used exclusively when CatalogImpl is requested to refreshByPath.

Replacing Segments of Logical Query Plan With Cached Data — useCachedData Method

useCachedData(plan: LogicalPlan): LogicalPlan


useCachedData is used exclusively when QueryExecution is requested for a cached logical query plan.

lookupAndRefresh Internal Method

  plan: LogicalPlan,
  fs: FileSystem,
  qualifiedPath: Path): Boolean


lookupAndRefresh is used exclusively when CacheManager is requested to recacheByPath.

results matching ""

    No results matching ""