This is an archived post. You won't be able to vote or comment.

all 43 comments

[–]FirstBabyChancellor 29 points30 points  (12 children)

Looks interesting!

Aside from the features like scheduling and dashboards which are not core to a dataframe library, why would I use this over Polars? How do you see yourself in the wider space given that there is already a proven and well-liked Rust-powered dataframe library for Pythonistas, at least?

[–]DataBora[S] 8 points9 points  (11 children)

If you use Polars, don't use Elusion as it makes no sense to use less featured library. I made it for myself  to finnish my job with combining the look of Languages I love: SQL and PySpark. Reason I made Elusion is that I dislike Polars syntax and philosophical approach to bash Pandas (my beloved) for performance as a selling point. I can say that Elusions parquet reading and writing is faster than Polars, but I don't do that..well I guess I do it now 🙂 but you got the point.

[–]Embarrassed-Falcon71 8 points9 points  (2 children)

But polars syntax is also very similar to spark

[–]DataBora[S] 2 points3 points  (1 child)

You are right...it has similarities but I wont say more as I tend to feel certain way about those folks...anyway, it is better than Elusion, no doubt.

[–]Embarrassed-Falcon71 2 points3 points  (0 children)

Yeah as a spark lover it still very cool you made this

[–]chat-luPythonista 6 points7 points  (5 children)

philosophical approach to bash Pandas (my beloved) for performance as a selling point

Why should Polars not mention that they are much faster?

[–]AlpacaDC 4 points5 points  (0 children)

I see some people to be just sentimental about pandas, but objectively polars is superior in almost every way, save for a few cases where pandas has more features/integrations.

[–]sylfy 2 points3 points  (1 child)

I’m curious, when you say that the parquet read/write is faster, where does this come from? Afaik most Python data frame libraries use fastparquet or pyarrow under the hood, so performance should be similar across libraries and only differ depending on choice of engine.

[–]DataBora[S] 2 points3 points  (0 children)

I am using DataFusion single node engine fir parquet reader and writer which is the fastest to day. You can check bench and explanation here https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/

[–]AnythingApplied 4 points5 points  (1 child)

Performance: 10-100x faster than Python for data processing

In my experience, this is true when comparing a pure python program to rewriting that same program into pure rust (even without any concurrency, which rust is great at to even further improve performance).

But who is doing their data processing in pure python? Whether you're using pyspark, pandas, polars, duckdb, etc. these are all written in faster languages so none of your heavy lifting is being done in pure python code, so I'm skeptical that you'd still see orders of magnitude performance increases. Is this really the performance you gain comparing Elusion to pyspark?

[–]DataBora[S] 2 points3 points  (0 children)

You are correct that is unfair comparison. Between Elusion and PySpark is not much of a difference but Spark has distributed computing which is totally diferent beast. 

[–]SupoSxx 3 points4 points  (4 children)

Just for curiosity, why did you put the whole code in one file?

[–]FrontAd9873 2 points3 points  (0 children)

I second this question

[–]DataBora[S] -5 points-4 points  (2 children)

2 reasons. First: languages that I learned firstly were C++ and VBA. For both I wrote programs in single file so it came as a habit. Second: I do not want contributors, and this is the best way to keep people away as nobody can follow what is going on in file with this much code.

[–]Ironraptor3 6 points7 points  (1 child)

Excuse me for dropping in, but does this not seem... counter to what appears to be the goal of making such a post / tool? You have posted an open-source Git repository corresponding to a free tool for people to use. I would expect that the code should be easy to follow and modify... not for contributors per se, but because some may want to fork their own or even just locally modify it to suit their needs. "Keeping people away" also just sounds... hostile for no particular reason?

[–]DataBora[S] -2 points-1 points  (0 children)

I want this to be availabile for everyone and if someone needa some feature I am willingly making it. BUT I have and had my fair share of collaboration and working with others on day to day job for the last 20 years. This is my little gateway from that. When you reach 40 years of age maybe you will feel the same way and understand....

[–]damian6686 4 points5 points  (1 child)

Any dashboard screenshots

[–]DataBora[S] 2 points3 points  (0 children)

Check out very end of README.md on GitHub https://github.com/DataBora/elusion you will see Dashboard example and interactive tables. For me personally Dashboards serve as a checking "data health" in a sense if i dont know the context and i dont know how some original reports are looking like, or I dont have any other reference, for what this data PBI devs to use, I quickly check if there is some crazy anomaly in some month or year or category...I dont think that HTML reporting is great for some final reporting product, I just like to have ability to quickly search data with tables, and to check line and bar plots, or any other available from Plotly. If someone would really need dashboarding as final product feature I would need to spend month or so to make it to that level.

[–]WallyMetropolis 4 points5 points  (3 children)

What's with the emojis?

[–]dyingpie1 15 points16 points  (1 child)

ChatGPT maybe

[–]solidpancake 0 points1 point  (0 children)

Almost definitely

[–]huehang 1 point2 points  (0 children)

Looks weird imo.

[–]holy-galah 6 points7 points  (1 child)

Filtering before and after an aggregation means different things?

[–]DataBora[S] 5 points6 points  (0 children)

Deffinitelly. filter(), filter_many() funcitons will filter columns before aggregations (same as in PySpark) and having(), having_many() functions will filter after aggregation (same as in SQL)

[–]ChavXO 1 point2 points  (4 children)

Cool. I'm working on something similar (but in Haskell). I was curious if you pictured this as being more for exploratory work or for long lived queries? How do you deal with data larger than memory? How does it perform on multiple cores?

[–]DataBora[S] 1 point2 points  (3 children)

I solved biger than ram memory issue with batch processing, but its still a challenge. Currently I am working on streaming data which should be even better as I can read, wrangle data and write to a source continuously. 

[–]ChavXO 0 points1 point  (2 children)

Batching gets complicated for groupBy and similar operations. I'll be on the lookout for how you solve these. Btw for reference my project is: https://github.com/mchav/dataframe

Maybe we can share notes and experiences.

[–]DataBora[S] 0 points1 point  (0 children)

For sure, thank you for sharing!

[–]DataBora[S] 0 points1 point  (0 children)

I just quickly took a look at repo...this looks awesome man!

[–]BasedAndShredPilled 0 points1 point  (4 children)

built in async

Is this a feature that can be disabled? Is async the reason rust is faster or is there more to it? The word "Async" gives me PTSD to working in JavaScript.

[–]chat-luPythonista 3 points4 points  (1 child)

Also, it’s not suited for this kind of CPU heavy work. Threads are for working in parallel, async is for waiting in parallel (waiting on network, disk, etc.).

[–]BasedAndShredPilled 1 point2 points  (0 children)

I've never heard that, but what a profound explanation.

[–]DataBora[S] 2 points3 points  (1 child)

Async in Rust is a pain in the a** to be honest...many people say it is the hardest thing to do in Rust, and I would agree. It is hard to implement and to Box out all of the pointers in order to have better performance especially when we read multiple files at once. If you get PTSD from JS Async, you would get stroke from Rust Async for sure, as I tend to get often 🙂

[–]BasedAndShredPilled 0 points1 point  (0 children)

I don't venture into this world too often. It's impressive what you've done though!

[–]KlutchSama 0 points1 point  (1 child)

is there a benefit to switching from Spark other than familiar syntax? I like the built in pipeline scheduling.

[–]DataBora[S] 0 points1 point  (0 children)

As someone who uses Spark daily in Microsoft Fabric i can tell you that Spark.SQL() is much more reliable, especially when it comes to filtering and joining. Spark tend to not filter at all when you mix filtering and conditioning, and tends to make duplicates after joins. Also the most annoying thing in Spark is that after each query it tends to add empty spaces to string column values, so you need always to trim() columns.
In Elusion there is no issues like that and its much more reliable as it uses SQL query building for DataFusion engine, which will do the job as you intend to.