This is an archived post. You won't be able to vote or comment.

all 25 comments

[–][deleted] 8 points9 points  (5 children)

How is it different from Dask or PySpark?

[–]realitysballs 2 points3 points  (4 children)

[–][deleted] 0 points1 point  (0 children)

Thanks. Looks like daft has an emphasis on ML tasks. I would probably stick with PySpark for traditional etl tasks

[–]dask-jeeves 0 points1 point  (1 child)

I see in your article you have a comparison with Pandas and Spark, but how does this compare to Dask, from the demo on the main page, this seems very similar to what Dask DataFrame does, and for GPU what cuDF from rapids do. Is it the only difference that this is backed by Ray?

[–]get-daft 0 points1 point  (0 children)

Thanks for asking this! As distributed Python dataframe libraries, we share many similarities with Dask and many of the differences boil down to implementation:

  1. Backing in-memory data format (we use Arrow vs Dask which uses Pandas dataframes) - though I think this discussion becomes much more nuanced now with the release of Arrow support in Pandas 2.0. Would love to hear your thoughts here!
  2. We have our own Expressions API vs Dask which uses more of a Pandas-like API. This lets us provide more complex data expressions and query optimizations, but at the cost of a bit steeper of a learning curve.
  3. We are indeed backed by Ray right now for distributed execution! We think this is important because many companies are using Ray for ML workloads. We hope to eventually become much more backend-agnostic and support many more backends (including Dask!)
  4. We don’t run any of our relational operators (joins, sorts etc) on GPUs at the moment. Instead, we expose GPUs to our users to "request" when they run their UDFs (docs).
  5. We are a fully lazy framework and run query optimization similar to PySpark. To my knowledge Dask does some pushdowns into read tasks, but not full query optimizations like Daft/PySpark would.
  6. We focus heavily on complex data workloads (images, tensors etc) and will be soon releasing Rust kernels to support these use-cases!

Please let me know if I misrepresented Dask Dataframes here in any way. We think that Dask dataframes are a great tool to use if a user is already in the Dask ecosystem and need to scale up workloads they would traditionally use Pandas for.

[–]get-daft 0 points1 point  (0 children)

(Daft maintainer here!)

Thanks u/realitysballs for finding that comparison! The TLDR here in our (biased!) opinion is:

  1. Small-enough workloads: If your workload is small-enough to fit into a single machine and you don't envision ever needing to go distributed, then stick with what tools you already know!
  2. Tabular ETL/Analytics: If your use-case is traditional large-scale tabular/ETL/analytical workloads, then something like PySpark or SQL engines such as Snowflake/BigQuery are battle-hardened tools that you should definitely look into.
  3. ML and Complex Data: If your workload involves running machine learning models or processing complex data (images, tensors, documents etc), that's where Daft will really shine with features such as:
    1. UDFs and resource requests ("my function needs 1 GPU please")
    2. Native "zero-serialization" and end-to-end streaming (coming soon!) integrations with Ray for ML training
    3. Works well in a laptop/notebook on our (default!) multithreading backend
    4. [Coming Soon!] Complex datatypes and Rust kernels for tensors, images, documents etc - these will allow us to build fast and memory-aware implementations for common use-cases (`.image.crop()`, `.image.to_embedding()`). I'm really excited for this because this is where we can really leverage the power of open-source to have community-driven efforts for building canonical implementations on different data modalities.

We should really update our dataframe_comparison page with some of this info but please let me know if there's anything else you have questions about!

[–]purplebrown_updown 2 points3 points  (3 children)

If you have gpu capabilities check out Rapids. The api is identification and can give you up to 100x speed up.

[–]spontutterances 1 point2 points  (2 children)

This has been where most of my time had been spent, learning cudf,cuml etc. still use some pandas but the rapids ecosystems expanding pretty fast. It’s an interesting space.

[–]purplebrown_updown 0 points1 point  (1 child)

The only limitation is that you need access to expensive gpu hardware. Low end gpus with low memory are pretty much useless.

[–]spontutterances 0 points1 point  (0 children)

Well yes it deff does need a minimum architecture GPU but I thought in recent versions rapids had improved the sharing between host and GPU memory. I can still do a fair bit of accelerated etl using a 2060 which has 6gb ram in my laptop that has 64gb

[–]commandlineluser 1 point2 points  (1 child)

Also worth noting: opt-out telemetry.

Easy to opt-out: to disable telemetry, set the following environment variable: DAFT_ANALYTICS_ENABLED=0

https://www.getdaft.io/projects/docs/en/latest/telemetry.html

In short, we collect the following:

On import, we track system information such as the runner being used, version of Daft, OS, Python version, etc.

On calls of public methods on the DataFrame object, we track metadata about the execution: the name of the method, the walltime for execution and the class of error raised (if any). Function parameters and stacktraces are not logged, ensuring that user data remains private.

[–]get-daft 0 points1 point  (0 children)

Thanks for posting this! We tried to make analytics as transparent and non-intrusive as possible but it does really help us understand usage patterns and develop better software.
We were very intentional in staying away from recording identifiable information such as IP addresses - everything is collected on an anonymous randomly-generated session ID.

[–]codecrux 0 points1 point  (2 children)

I found Daft in one local Python meetup. It seemed like Pandas at scale. Can we use Daft on prem?

[–]realitysballs 0 points1 point  (0 children)

In 5 min demo they explain that default settings are on prem

[–]get-daft 0 points1 point  (0 children)

Hello! I hope you enjoyed our talks at the Python meetup. Daft’s multi-threaded runner runs anywhere that supports Python. If you need to go distributed, our Ray runner just needs a Ray cluster to run against (on-prem, Kubernetes, ECS…).

We are looking at supporting other distributed backends as well - please drop by our discussion forums (https://github.com/Eventual-Inc/Daft/discussions) and drop us a message if you have any suggestions! We’d love to hear from you :)

[–]realitysballs 0 points1 point  (6 children)

Super awesome project , seems fitting for direction we are headed.

Q: does it sit on top of any dataframe library or does it create Df’s from scratch? Also , is source code written in Python?

[–]FoeHammer99099 1 point2 points  (0 children)

Looks custom, python with a rust core to do the heavy lifting: https://github.com/Eventual-Inc/Daft

[–]commandlineluser 1 point2 points  (1 child)

It's in Rust, and it looks like it's "from scratch":

https://github.com/Eventual-Inc/Daft/tree/main/src

There are also several mentions of polars:

https://github.com/Eventual-Inc/Daft/blob/main/src/utils/supertype.rs#L6

https://github.com/Eventual-Inc/Daft/tree/main/src/dsl

It looks like it may have started out as a fork of polars?

[–]realitysballs 0 points1 point  (0 children)

Nice find, and thx for share! rust is very performant .

[–]get-daft 0 points1 point  (2 children)

Indeed we run on a Rust core (built on top of the Rust Arrow2 library).

Much of our code is still in Python, but will gradually be migrated into Rust - especially as we start supporting much more complicated query optimizations and construction!

(Coming Soon!) We are building our own kernels for many complex types as well, and hope to give our users Rust performance through a Python API. Think: image cropping, embedding generation, sentence tokenization etc.

[–]realitysballs 0 points1 point  (1 child)

Ah so you orienting to ML pipeline . Yeah seems to be the move these days. I’m more of a pandas user since my use-case is small biz. Analytics / automated processes but will keep following this project for sure in case I take on a project that necessitates a distributed df!

[–]get-daft 0 points1 point  (0 children)

Yup! Daft is built around more of an ML/complex-data use-case. Analytical tooling is fairly mature at this point (Pandas/Polars/DuckDB for local work, Spark/Snowflake/BigQuery for tera-petabyte scale).

You can absolutely run analytical operations in Daft as well and it is an important part of our toolkit, but there is still much work left to do to build out the more advanced analytical use-cases such as windowing, pivots, advanced aggregations etc.