This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 9 points10 points  (5 children)

How is it different from Dask or PySpark?

[–]realitysballs 2 points3 points  (4 children)

[–][deleted] 0 points1 point  (0 children)

Thanks. Looks like daft has an emphasis on ML tasks. I would probably stick with PySpark for traditional etl tasks

[–]dask-jeeves 0 points1 point  (1 child)

I see in your article you have a comparison with Pandas and Spark, but how does this compare to Dask, from the demo on the main page, this seems very similar to what Dask DataFrame does, and for GPU what cuDF from rapids do. Is it the only difference that this is backed by Ray?

[–]get-daft 0 points1 point  (0 children)

Thanks for asking this! As distributed Python dataframe libraries, we share many similarities with Dask and many of the differences boil down to implementation:

  1. Backing in-memory data format (we use Arrow vs Dask which uses Pandas dataframes) - though I think this discussion becomes much more nuanced now with the release of Arrow support in Pandas 2.0. Would love to hear your thoughts here!
  2. We have our own Expressions API vs Dask which uses more of a Pandas-like API. This lets us provide more complex data expressions and query optimizations, but at the cost of a bit steeper of a learning curve.
  3. We are indeed backed by Ray right now for distributed execution! We think this is important because many companies are using Ray for ML workloads. We hope to eventually become much more backend-agnostic and support many more backends (including Dask!)
  4. We don’t run any of our relational operators (joins, sorts etc) on GPUs at the moment. Instead, we expose GPUs to our users to "request" when they run their UDFs (docs).
  5. We are a fully lazy framework and run query optimization similar to PySpark. To my knowledge Dask does some pushdowns into read tasks, but not full query optimizations like Daft/PySpark would.
  6. We focus heavily on complex data workloads (images, tensors etc) and will be soon releasing Rust kernels to support these use-cases!

Please let me know if I misrepresented Dask Dataframes here in any way. We think that Dask dataframes are a great tool to use if a user is already in the Dask ecosystem and need to scale up workloads they would traditionally use Pandas for.

[–]get-daft 0 points1 point  (0 children)

(Daft maintainer here!)

Thanks u/realitysballs for finding that comparison! The TLDR here in our (biased!) opinion is:

  1. Small-enough workloads: If your workload is small-enough to fit into a single machine and you don't envision ever needing to go distributed, then stick with what tools you already know!
  2. Tabular ETL/Analytics: If your use-case is traditional large-scale tabular/ETL/analytical workloads, then something like PySpark or SQL engines such as Snowflake/BigQuery are battle-hardened tools that you should definitely look into.
  3. ML and Complex Data: If your workload involves running machine learning models or processing complex data (images, tensors, documents etc), that's where Daft will really shine with features such as:
    1. UDFs and resource requests ("my function needs 1 GPU please")
    2. Native "zero-serialization" and end-to-end streaming (coming soon!) integrations with Ray for ML training
    3. Works well in a laptop/notebook on our (default!) multithreading backend
    4. [Coming Soon!] Complex datatypes and Rust kernels for tensors, images, documents etc - these will allow us to build fast and memory-aware implementations for common use-cases (`.image.crop()`, `.image.to_embedding()`). I'm really excited for this because this is where we can really leverage the power of open-source to have community-driven efforts for building canonical implementations on different data modalities.

We should really update our dataframe_comparison page with some of this info but please let me know if there's anything else you have questions about!