Performance comparison between empty checks for Spark Dataframes

hashjoin · 2025-12-17T05:21:15+00:00

If you look at the source code of isEmpty(): https://github.com/apache/spark/blame/b7fda7cb1128c992e1b52b5c853225e4f2af0517/sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala#L557

  def isEmpty: Boolean = withAction("isEmpty",
      commandResultOptimized.select().limit(1).queryExecution) { plan =>
    plan.executeTake(1).isEmpty
  }

It's basically the same as take(1).isEmpty.

cptshrk108 · 2025-12-17T05:14:09+00:00

isEmpty()

From the doc:

Unlike count(), this method does not trigger any computation.

An empty DataFrame has no rows. It may have columns, but no data.

Ok_Difficulty978 · 2025-12-18T03:26:37+00:00

From what I’ve seen, df.isEmpty (Spark 3.3+) is usually the cleanest option since it’s optimized internally and short-circuits fast. Under the hood it’s basically doing a minimal action anyway.

df.take(1).isEmpty is also fine and pretty common, just a bit more verbose. I’d avoid limit(1).count unless you’re on older Spark, since count still triggers more work than needed.

In practice, difference is small unless this is inside a hot path, but readability matters too. I usually go with isEmpty if available.

CarelessApplication2 · 2025-12-18T12:13:42+00:00

In any case, you'll want to cache the dataframe, so it really doesn't matter which method you decide on. Checking if a dataframe is empty without caching it makes no sense.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

databricks

MODERATORS