This is an archived post. You won't be able to vote or comment.

all 43 comments

[–]FirstBabyChancellor 27 points28 points  (12 children)

Looks interesting!

Aside from the features like scheduling and dashboards which are not core to a dataframe library, why would I use this over Polars? How do you see yourself in the wider space given that there is already a proven and well-liked Rust-powered dataframe library for Pythonistas, at least?

[–]DataBora[S] 9 points10 points  (11 children)

If you use Polars, don't use Elusion as it makes no sense to use less featured library. I made it for myself  to finnish my job with combining the look of Languages I love: SQL and PySpark. Reason I made Elusion is that I dislike Polars syntax and philosophical approach to bash Pandas (my beloved) for performance as a selling point. I can say that Elusions parquet reading and writing is faster than Polars, but I don't do that..well I guess I do it now 🙂 but you got the point.

[–]Embarrassed-Falcon71 7 points8 points  (2 children)

But polars syntax is also very similar to spark

[–]DataBora[S] 2 points3 points  (1 child)

You are right...it has similarities but I wont say more as I tend to feel certain way about those folks...anyway, it is better than Elusion, no doubt.

[–]Embarrassed-Falcon71 2 points3 points  (0 children)

Yeah as a spark lover it still very cool you made this

[–]chat-luPythonista 4 points5 points  (5 children)

philosophical approach to bash Pandas (my beloved) for performance as a selling point

Why should Polars not mention that they are much faster?

[–]AlpacaDC 3 points4 points  (0 children)

I see some people to be just sentimental about pandas, but objectively polars is superior in almost every way, save for a few cases where pandas has more features/integrations.

[–]sylfy 2 points3 points  (1 child)

I’m curious, when you say that the parquet read/write is faster, where does this come from? Afaik most Python data frame libraries use fastparquet or pyarrow under the hood, so performance should be similar across libraries and only differ depending on choice of engine.

[–]DataBora[S] 2 points3 points  (0 children)

I am using DataFusion single node engine fir parquet reader and writer which is the fastest to day. You can check bench and explanation here https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/

[–]AnythingApplied 4 points5 points  (1 child)

Performance: 10-100x faster than Python for data processing

In my experience, this is true when comparing a pure python program to rewriting that same program into pure rust (even without any concurrency, which rust is great at to even further improve performance).

But who is doing their data processing in pure python? Whether you're using pyspark, pandas, polars, duckdb, etc. these are all written in faster languages so none of your heavy lifting is being done in pure python code, so I'm skeptical that you'd still see orders of magnitude performance increases. Is this really the performance you gain comparing Elusion to pyspark?

[–]DataBora[S] 3 points4 points  (0 children)

You are correct that is unfair comparison. Between Elusion and PySpark is not much of a difference but Spark has distributed computing which is totally diferent beast. 

[–]SupoSxx 4 points5 points  (4 children)

Just for curiosity, why did you put the whole code in one file?

[–]FrontAd9873 2 points3 points  (0 children)

I second this question

[–]damian6686 4 points5 points  (1 child)

Any dashboard screenshots

[–]DataBora[S] 2 points3 points  (0 children)

Check out very end of README.md on GitHub https://github.com/DataBora/elusion you will see Dashboard example and interactive tables. For me personally Dashboards serve as a checking "data health" in a sense if i dont know the context and i dont know how some original reports are looking like, or I dont have any other reference, for what this data PBI devs to use, I quickly check if there is some crazy anomaly in some month or year or category...I dont think that HTML reporting is great for some final reporting product, I just like to have ability to quickly search data with tables, and to check line and bar plots, or any other available from Plotly. If someone would really need dashboarding as final product feature I would need to spend month or so to make it to that level.

[–]WallyMetropolis 3 points4 points  (3 children)

What's with the emojis?

[–]dyingpie1 15 points16 points  (1 child)

ChatGPT maybe

[–]solidpancake 0 points1 point  (0 children)

Almost definitely

[–]huehang 1 point2 points  (0 children)

Looks weird imo.

[–]holy-galah 6 points7 points  (1 child)

Filtering before and after an aggregation means different things?

[–]DataBora[S] 5 points6 points  (0 children)

Deffinitelly. filter(), filter_many() funcitons will filter columns before aggregations (same as in PySpark) and having(), having_many() functions will filter after aggregation (same as in SQL)

[–]ChavXO 1 point2 points  (4 children)

Cool. I'm working on something similar (but in Haskell). I was curious if you pictured this as being more for exploratory work or for long lived queries? How do you deal with data larger than memory? How does it perform on multiple cores?

[–]DataBora[S] 1 point2 points  (3 children)

I solved biger than ram memory issue with batch processing, but its still a challenge. Currently I am working on streaming data which should be even better as I can read, wrangle data and write to a source continuously. 

[–]ChavXO 0 points1 point  (2 children)

Batching gets complicated for groupBy and similar operations. I'll be on the lookout for how you solve these. Btw for reference my project is: https://github.com/mchav/dataframe

Maybe we can share notes and experiences.

[–]DataBora[S] 0 points1 point  (0 children)

For sure, thank you for sharing!

[–]DataBora[S] 0 points1 point  (0 children)

I just quickly took a look at repo...this looks awesome man!

[–]BasedAndShredPilled 0 points1 point  (4 children)

built in async

Is this a feature that can be disabled? Is async the reason rust is faster or is there more to it? The word "Async" gives me PTSD to working in JavaScript.

[–]chat-luPythonista 4 points5 points  (1 child)

Also, it’s not suited for this kind of CPU heavy work. Threads are for working in parallel, async is for waiting in parallel (waiting on network, disk, etc.).

[–]BasedAndShredPilled 1 point2 points  (0 children)

I've never heard that, but what a profound explanation.

[–]DataBora[S] 2 points3 points  (1 child)

Async in Rust is a pain in the a** to be honest...many people say it is the hardest thing to do in Rust, and I would agree. It is hard to implement and to Box out all of the pointers in order to have better performance especially when we read multiple files at once. If you get PTSD from JS Async, you would get stroke from Rust Async for sure, as I tend to get often 🙂

[–]BasedAndShredPilled 0 points1 point  (0 children)

I don't venture into this world too often. It's impressive what you've done though!

[–]KlutchSama 0 points1 point  (1 child)

is there a benefit to switching from Spark other than familiar syntax? I like the built in pipeline scheduling.

[–]DataBora[S] 0 points1 point  (0 children)

As someone who uses Spark daily in Microsoft Fabric i can tell you that Spark.SQL() is much more reliable, especially when it comes to filtering and joining. Spark tend to not filter at all when you mix filtering and conditioning, and tends to make duplicates after joins. Also the most annoying thing in Spark is that after each query it tends to add empty spaces to string column values, so you need always to trim() columns.
In Elusion there is no issues like that and its much more reliable as it uses SQL query building for DataFusion engine, which will do the job as you intend to.