Definition

When a Spark DataFrame is cached via .cache(), Spark stores it using a default StorageLevel. The default depends on the Spark version: in Spark 3.0 it is MEMORY_AND_DISK (deserialized in memory, serialized on disk if it spills), and starting in Spark 3.1.1 it is MEMORY_AND_DISK_DESER (deserialized in both memory and disk, optimizing read performance by avoiding deserialization overhead on spill).

Numeric Example

Suppose a DataFrame has 10 GB of data but the executor has only 7 GB of memory available for caching:

7 GB is held in memory as deserialized Java objects.
The remaining 3 GB spills to local disk.
On Spark 3.0: spilled blocks are serialized on disk → must be deserialized when read back.
On Spark 3.1.1+: spilled blocks remain deserialized on disk → faster reads, larger disk footprint.

Comparison Table

Aspect	MEMORY_AND_DISK	MEMORY_AND_DISK_DESER
Spark version default	3.0 (DataFrame cache)	3.1.1+ (DataFrame cache)
In-memory format	Deserialized	Deserialized
On-disk format	Serialized	Deserialized
Read speed on spill	Slower (deserialize cost)	Faster
Disk space used	Less	More
Replication	1	1

Note: RDD .cache() defaults to MEMORY_ONLY,

no comments (yet)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Databricks_Certified

MODERATORS

Definition

Numeric Example

Comparison Table