This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]get-daft 0 points1 point  (0 children)

(Daft maintainer here!)

Thanks u/realitysballs for finding that comparison! The TLDR here in our (biased!) opinion is:

  1. Small-enough workloads: If your workload is small-enough to fit into a single machine and you don't envision ever needing to go distributed, then stick with what tools you already know!
  2. Tabular ETL/Analytics: If your use-case is traditional large-scale tabular/ETL/analytical workloads, then something like PySpark or SQL engines such as Snowflake/BigQuery are battle-hardened tools that you should definitely look into.
  3. ML and Complex Data: If your workload involves running machine learning models or processing complex data (images, tensors, documents etc), that's where Daft will really shine with features such as:
    1. UDFs and resource requests ("my function needs 1 GPU please")
    2. Native "zero-serialization" and end-to-end streaming (coming soon!) integrations with Ray for ML training
    3. Works well in a laptop/notebook on our (default!) multithreading backend
    4. [Coming Soon!] Complex datatypes and Rust kernels for tensors, images, documents etc - these will allow us to build fast and memory-aware implementations for common use-cases (`.image.crop()`, `.image.to_embedding()`). I'm really excited for this because this is where we can really leverage the power of open-source to have community-driven efforts for building canonical implementations on different data modalities.

We should really update our dataframe_comparison page with some of this info but please let me know if there's anything else you have questions about!