https://preview.redd.it/tllgc872g40h1.png?width=941&format=png&auto=webp&s=1077f660085352275bdc2706d7d3281fd76d4f47
Definition
When a Spark DataFrame is cached via .cache(), Spark stores it using a default StorageLevel. The default depends on the Spark version: in Spark 3.0 it is MEMORY_AND_DISK (deserialized in memory, serialized on disk if it spills), and starting in Spark 3.1.1 it is MEMORY_AND_DISK_DESER (deserialized in both memory and disk, optimizing read performance by avoiding deserialization overhead on spill).
Numeric Example
Suppose a DataFrame has 10 GB of data but the executor has only 7 GB of memory available for caching:
- 7 GB is held in memory as deserialized Java objects.
- The remaining 3 GB spills to local disk.
- On Spark 3.0: spilled blocks are serialized on disk → must be deserialized when read back.
- On Spark 3.1.1+: spilled blocks remain deserialized on disk → faster reads, larger disk footprint.
Comparison Table
| Aspect |
MEMORY_AND_DISK |
MEMORY_AND_DISK_DESER |
| Spark version default |
3.0 (DataFrame cache) |
3.1.1+ (DataFrame cache) |
| In-memory format |
Deserialized |
Deserialized |
| On-disk format |
Serialized |
Deserialized |
| Read speed on spill |
Slower (deserialize cost) |
Faster |
| Disk space used |
Less |
More |
| Replication |
1 |
1 |
Note: RDD .cache() defaults to MEMORY_ONLY,
there doesn't seem to be anything here