RDD

RDD is a description of a distributed computation over dataset of records of type T.

RDD is identified by a unique identifier (aka RDD ID) that is unique among all RDDs in a SparkContext.

id: Int

RDD has a storage level that…​FIXME

storageLevel: StorageLevel

The storage level of an RDD is StorageLevel.NONE by default which is…​FIXME

Getting Or Computing RDD Partition — getOrCompute Method

getOrCompute(partition: Partition, context: TaskContext): Iterator[T]

getOrCompute creates a RDDBlockId for the RDD id and the partition index.

getOrCompute requests the BlockManager to getOrElseUpdate for the block ID (with the storage level and the makeIterator function).

Note
getOrCompute uses SparkEnv to access the current BlockManager.

getOrCompute records whether…​FIXME (readCachedBlock)

getOrCompute branches off per the response from the BlockManager and whether the internal readCachedBlock flag is now on or still off. In either case, getOrCompute creates an InterruptibleIterator.

Note
InterruptibleIterator simply delegates to a wrapped internal Iterator, but allows for task killing functionality.

For a BlockResult available and readCachedBlock flag on, getOrCompute…​FIXME

For a BlockResult available and readCachedBlock flag off, getOrCompute…​FIXME

Note
The BlockResult could be found in a local block manager or fetched from a remote block manager. It may also have been stored (persisted) just now. In either case, the BlockResult is available (and BlockManager.getOrElseUpdate gives a Left value with the BlockResult).

For Right(iter) (regardless of the value of readCachedBlock flag since…​FIXME), getOrCompute…​FIXME

Note
BlockManager.getOrElseUpdate gives a Right(iter) value to indicate an error with a block.
Note
getOrCompute is used on Spark executors.
Note
getOrCompute is a private[spark] method that is exclusively used when iterating over partition when a RDD is cached.

Computing Partition (in TaskContext) — compute Method

compute(split: Partition, context: TaskContext): Iterator[T]

The abstract compute method computes the input split partition in the TaskContext to produce a collection of values (of type T).

compute is implemented by any type of RDD in Spark and is called every time the records are requested unless RDD is cached or checkpointed (and the records can be read from an external storage, but this time closer to the compute node).

When an RDD is cached, for specified storage levels (i.e. all but NONE) CacheManager is requested to get or compute partitions.

Note
compute method runs on the driver.

results matching ""

    No results matching ""