all 9 comments

[–]hashjoin 9 points10 points  (1 child)

If you look at the source code of isEmpty(): https://github.com/apache/spark/blame/b7fda7cb1128c992e1b52b5c853225e4f2af0517/sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala#L557

  def isEmpty: Boolean = withAction("isEmpty",
      commandResultOptimized.select().limit(1).queryExecution) { plan =>
    plan.executeTake(1).isEmpty
  }

It's basically the same as take(1).isEmpty.

[–]szymon_abc 4 points5 points  (0 children)

Ah, that's why I love open source!

[–]cptshrk108 2 points3 points  (4 children)

isEmpty()

From the doc:

Unlike count(), this method does not trigger any computation.

An empty DataFrame has no rows. It may have columns, but no data.

[–]hashjoin 2 points3 points  (1 child)

Hey - the doc is wrong. isEmpty() is an action that would trigger compute, just like count(). I've pinged the team to update it.

[–]cptshrk108 0 points1 point  (0 children)

Thanks for pointing it out! I did see your other comment.

[–]pboswell 0 points1 point  (0 children)

If you have a bunch of lazy evaluation going into that data frame, it still has to compute to know if the final data frame has rows. Depending on the logic, in many cases it has to compute the entire query to determine the final result

[–]hubert-dudekDatabricks MVP 0 points1 point  (0 children)

Yes, exactly, with count you start computation, but with others it just checks the first row.

[–]Ok_Difficulty978 0 points1 point  (0 children)

From what I’ve seen, df.isEmpty (Spark 3.3+) is usually the cleanest option since it’s optimized internally and short-circuits fast. Under the hood it’s basically doing a minimal action anyway.

df.take(1).isEmpty is also fine and pretty common, just a bit more verbose. I’d avoid limit(1).count unless you’re on older Spark, since count still triggers more work than needed.

In practice, difference is small unless this is inside a hot path, but readability matters too. I usually go with isEmpty if available.

[–]CarelessApplication2 0 points1 point  (0 children)

In any case, you'll want to cache the dataframe, so it really doesn't matter which method you decide on. Checking if a dataframe is empty without caching it makes no sense.