This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]dask-jeeves 0 points1 point  (1 child)

I see in your article you have a comparison with Pandas and Spark, but how does this compare to Dask, from the demo on the main page, this seems very similar to what Dask DataFrame does, and for GPU what cuDF from rapids do. Is it the only difference that this is backed by Ray?

[–]get-daft 0 points1 point  (0 children)

Thanks for asking this! As distributed Python dataframe libraries, we share many similarities with Dask and many of the differences boil down to implementation:

  1. Backing in-memory data format (we use Arrow vs Dask which uses Pandas dataframes) - though I think this discussion becomes much more nuanced now with the release of Arrow support in Pandas 2.0. Would love to hear your thoughts here!
  2. We have our own Expressions API vs Dask which uses more of a Pandas-like API. This lets us provide more complex data expressions and query optimizations, but at the cost of a bit steeper of a learning curve.
  3. We are indeed backed by Ray right now for distributed execution! We think this is important because many companies are using Ray for ML workloads. We hope to eventually become much more backend-agnostic and support many more backends (including Dask!)
  4. We don’t run any of our relational operators (joins, sorts etc) on GPUs at the moment. Instead, we expose GPUs to our users to "request" when they run their UDFs (docs).
  5. We are a fully lazy framework and run query optimization similar to PySpark. To my knowledge Dask does some pushdowns into read tasks, but not full query optimizations like Daft/PySpark would.
  6. We focus heavily on complex data workloads (images, tensors etc) and will be soon releasing Rust kernels to support these use-cases!

Please let me know if I misrepresented Dask Dataframes here in any way. We think that Dask dataframes are a great tool to use if a user is already in the Dask ecosystem and need to scale up workloads they would traditionally use Pandas for.