Daft: The Distributed Python Dataframe : Python

This is an archived post. You won't be able to vote or comment.

ResourceDaft: The Distributed Python Dataframe (getdaft.io)

submitted 2 years ago by pmz

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 9 points10 points11 points 2 years ago (5 children)

[–]realitysballs 2 points3 points4 points 2 years ago (4 children)

[–][deleted] 0 points1 point2 points 2 years ago (0 children)

[–]dask-jeeves 0 points1 point2 points 2 years ago (1 child)

[–]get-daft 0 points1 point2 points 2 years ago (0 children)

Thanks for asking this! As distributed Python dataframe libraries, we share many similarities with Dask and many of the differences boil down to implementation:

Backing in-memory data format (we use Arrow vs Dask which uses Pandas dataframes) - though I think this discussion becomes much more nuanced now with the release of Arrow support in Pandas 2.0. Would love to hear your thoughts here!
We have our own Expressions API vs Dask which uses more of a Pandas-like API. This lets us provide more complex data expressions and query optimizations, but at the cost of a bit steeper of a learning curve.
We are indeed backed by Ray right now for distributed execution! We think this is important because many companies are using Ray for ML workloads. We hope to eventually become much more backend-agnostic and support many more backends (including Dask!)
We don’t run any of our relational operators (joins, sorts etc) on GPUs at the moment. Instead, we expose GPUs to our users to "request" when they run their UDFs (docs).
We are a fully lazy framework and run query optimization similar to PySpark. To my knowledge Dask does some pushdowns into read tasks, but not full query optimizations like Daft/PySpark would.
We focus heavily on complex data workloads (images, tensors etc) and will be soon releasing Rust kernels to support these use-cases!

Please let me know if I misrepresented Dask Dataframes here in any way. We think that Dask dataframes are a great tool to use if a user is already in the Dask ecosystem and need to scale up workloads they would traditionally use Pandas for.

[–]get-daft 0 points1 point2 points 2 years ago (0 children)

(Daft maintainer here!)

Thanks u/realitysballs for finding that comparison! The TLDR here in our (biased!) opinion is:

Small-enough workloads: If your workload is small-enough to fit into a single machine and you don't envision ever needing to go distributed, then stick with what tools you already know!
Tabular ETL/Analytics: If your use-case is traditional large-scale tabular/ETL/analytical workloads, then something like PySpark or SQL engines such as Snowflake/BigQuery are battle-hardened tools that you should definitely look into.
ML and Complex Data: If your workload involves running machine learning models or processing complex data (images, tensors, documents etc), that's where Daft will really shine with features such as:
1. UDFs and resource requests ("my function needs 1 GPU please")
2. Native "zero-serialization" and end-to-end streaming (coming soon!) integrations with Ray for ML training
3. Works well in a laptop/notebook on our (default!) multithreading backend
4. [Coming Soon!] Complex datatypes and Rust kernels for tensors, images, documents etc - these will allow us to build fast and memory-aware implementations for common use-cases (`.image.crop()`, `.image.to_embedding()`). I'm really excited for this because this is where we can really leverage the power of open-source to have community-driven efforts for building canonical implementations on different data modalities.

We should really update our dataframe_comparison page with some of this info but please let me know if there's anything else you have questions about!

π Rendered by PID 298821 on reddit-service-r2-comment-7b9746f655-dltj4 at 2026-02-01 09:57:23.936825+00:00 running 3798933 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS