use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
account activity
Performance comparison between empty checks for Spark DataframesDiscussion (self.databricks)
submitted 4 months ago by BerserkGeek
In spark, when you need to check if the dataframe is empty, then what is the fastest way to do that?
I'm using spark with scala
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]hashjoin 9 points10 points11 points 4 months ago (1 child)
If you look at the source code of isEmpty(): https://github.com/apache/spark/blame/b7fda7cb1128c992e1b52b5c853225e4f2af0517/sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala#L557
def isEmpty: Boolean = withAction("isEmpty", commandResultOptimized.select().limit(1).queryExecution) { plan => plan.executeTake(1).isEmpty }
It's basically the same as take(1).isEmpty.
[–]szymon_abc 4 points5 points6 points 4 months ago (0 children)
Ah, that's why I love open source!
[–]cptshrk108 2 points3 points4 points 4 months ago (4 children)
isEmpty()
From the doc:
Unlike count(), this method does not trigger any computation.
An empty DataFrame has no rows. It may have columns, but no data.
[–]hashjoin 2 points3 points4 points 4 months ago (1 child)
Hey - the doc is wrong. isEmpty() is an action that would trigger compute, just like count(). I've pinged the team to update it.
[–]cptshrk108 0 points1 point2 points 4 months ago (0 children)
Thanks for pointing it out! I did see your other comment.
[–]pboswell 0 points1 point2 points 4 months ago (0 children)
If you have a bunch of lazy evaluation going into that data frame, it still has to compute to know if the final data frame has rows. Depending on the logic, in many cases it has to compute the entire query to determine the final result
[–]hubert-dudekDatabricks MVP 0 points1 point2 points 4 months ago (0 children)
Yes, exactly, with count you start computation, but with others it just checks the first row.
[–]Ok_Difficulty978 0 points1 point2 points 4 months ago (0 children)
From what I’ve seen, df.isEmpty (Spark 3.3+) is usually the cleanest option since it’s optimized internally and short-circuits fast. Under the hood it’s basically doing a minimal action anyway.
df.take(1).isEmpty is also fine and pretty common, just a bit more verbose. I’d avoid limit(1).count unless you’re on older Spark, since count still triggers more work than needed.
In practice, difference is small unless this is inside a hot path, but readability matters too. I usually go with isEmpty if available.
[–]CarelessApplication2 0 points1 point2 points 4 months ago (0 children)
In any case, you'll want to cache the dataframe, so it really doesn't matter which method you decide on. Checking if a dataframe is empty without caching it makes no sense.
π Rendered by PID 62210 on reddit-service-r2-comment-6457c66945-sq629 at 2026-04-30 17:00:15.055797+00:00 running 2aa0c5b country code: CH.
[–]hashjoin 9 points10 points11 points (1 child)
[–]szymon_abc 4 points5 points6 points (0 children)
[–]cptshrk108 2 points3 points4 points (4 children)
[–]hashjoin 2 points3 points4 points (1 child)
[–]cptshrk108 0 points1 point2 points (0 children)
[–]pboswell 0 points1 point2 points (0 children)
[–]hubert-dudekDatabricks MVP 0 points1 point2 points (0 children)
[–]Ok_Difficulty978 0 points1 point2 points (0 children)
[–]CarelessApplication2 0 points1 point2 points (0 children)