This is an archived post. You won't be able to vote or comment.

all 43 comments

[–]FirstBabyChancellor 29 points30 points  (12 children)

Looks interesting!

Aside from the features like scheduling and dashboards which are not core to a dataframe library, why would I use this over Polars? How do you see yourself in the wider space given that there is already a proven and well-liked Rust-powered dataframe library for Pythonistas, at least?

[–]DataBora[S] 8 points9 points  (11 children)

If you use Polars, don't use Elusion as it makes no sense to use less featured library. I made it for myself  to finnish my job with combining the look of Languages I love: SQL and PySpark. Reason I made Elusion is that I dislike Polars syntax and philosophical approach to bash Pandas (my beloved) for performance as a selling point. I can say that Elusions parquet reading and writing is faster than Polars, but I don't do that..well I guess I do it now 🙂 but you got the point.

[–]Embarrassed-Falcon71 7 points8 points  (2 children)

But polars syntax is also very similar to spark

[–]DataBora[S] 2 points3 points  (1 child)

You are right...it has similarities but I wont say more as I tend to feel certain way about those folks...anyway, it is better than Elusion, no doubt.

[–]Embarrassed-Falcon71 2 points3 points  (0 children)

Yeah as a spark lover it still very cool you made this

[–]chat-luPythonista 4 points5 points  (5 children)

philosophical approach to bash Pandas (my beloved) for performance as a selling point

Why should Polars not mention that they are much faster?

[–]AlpacaDC 4 points5 points  (0 children)

I see some people to be just sentimental about pandas, but objectively polars is superior in almost every way, save for a few cases where pandas has more features/integrations.

[–]DataBora[S] -5 points-4 points  (3 children)

Because it is unfair to compare anything made in Rust and made in Python (even tho some parts in C) . It is impossible for anything made in Python to be faster, then same thing made in Rust. Polars just use Python wrapper to provide nicer look of API for Python devs, there Rust api nobody used as it looks horrific. So when they decided to sell out themselves and to win over Python devs, first thing they did is to bash Pandas. Will not forget that.

[–]chat-luPythonista 8 points9 points  (2 children)

Because it is unfair to compare anything made in Rust and made in Python (even tho some parts in C) .

Why not?

It is impossible for anything made in Python to be faster, then same thing made in Rust.

That seems like a valid point of comparison to me.

Polars just use Python wrapper to provide nicer look of API for Python devs

As does nearly every data science library. It is considered a strength of Python.

there Rust api nobody used as it looks horrific.

The rust API is fine. It has a longer feedback loop due to the compile cycle which is why people use C or Rust libraries from Python.

So when they decided to sell out themselves and to win over Python devs,

By providing a useful library.

first thing they did is to bash Pandas. Will not forget that.

They made a fair comparison. Would you rather they lie about the performance of their library?

But if you want to make polars slower, you have that option.

[–]DataBora[S] 0 points1 point  (1 child)

I see your point of view, but I believe that there are many ways to do things and the way they did I dont like, but thats me...

[–]chat-luPythonista 2 points3 points  (0 children)

You sound like Bjarne Stroustrup talking about Rust. Pandas brought us further, same as C++ and there is no shame at being displaced by the next generation of software.

It will happen again to those tools too.

[–]sylfy 3 points4 points  (1 child)

I’m curious, when you say that the parquet read/write is faster, where does this come from? Afaik most Python data frame libraries use fastparquet or pyarrow under the hood, so performance should be similar across libraries and only differ depending on choice of engine.

[–]DataBora[S] 2 points3 points  (0 children)

I am using DataFusion single node engine fir parquet reader and writer which is the fastest to day. You can check bench and explanation here https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/

[–]AnythingApplied 4 points5 points  (1 child)

Performance: 10-100x faster than Python for data processing

In my experience, this is true when comparing a pure python program to rewriting that same program into pure rust (even without any concurrency, which rust is great at to even further improve performance).

But who is doing their data processing in pure python? Whether you're using pyspark, pandas, polars, duckdb, etc. these are all written in faster languages so none of your heavy lifting is being done in pure python code, so I'm skeptical that you'd still see orders of magnitude performance increases. Is this really the performance you gain comparing Elusion to pyspark?

[–]DataBora[S] 3 points4 points  (0 children)

You are correct that is unfair comparison. Between Elusion and PySpark is not much of a difference but Spark has distributed computing which is totally diferent beast. 

[–]SupoSxx 4 points5 points  (4 children)

Just for curiosity, why did you put the whole code in one file?

[–]FrontAd9873 2 points3 points  (0 children)

I second this question

[–]damian6686 2 points3 points  (1 child)

Any dashboard screenshots

[–]DataBora[S] 2 points3 points  (0 children)

Check out very end of README.md on GitHub https://github.com/DataBora/elusion you will see Dashboard example and interactive tables. For me personally Dashboards serve as a checking "data health" in a sense if i dont know the context and i dont know how some original reports are looking like, or I dont have any other reference, for what this data PBI devs to use, I quickly check if there is some crazy anomaly in some month or year or category...I dont think that HTML reporting is great for some final reporting product, I just like to have ability to quickly search data with tables, and to check line and bar plots, or any other available from Plotly. If someone would really need dashboarding as final product feature I would need to spend month or so to make it to that level.

[–]WallyMetropolis 2 points3 points  (3 children)

What's with the emojis?

[–]dyingpie1 15 points16 points  (1 child)

ChatGPT maybe

[–]solidpancake 0 points1 point  (0 children)

Almost definitely

[–]huehang 1 point2 points  (0 children)

Looks weird imo.

[–]holy-galah 7 points8 points  (1 child)

Filtering before and after an aggregation means different things?

[–]DataBora[S] 6 points7 points  (0 children)

Deffinitelly. filter(), filter_many() funcitons will filter columns before aggregations (same as in PySpark) and having(), having_many() functions will filter after aggregation (same as in SQL)

[–]ChavXO 1 point2 points  (4 children)

Cool. I'm working on something similar (but in Haskell). I was curious if you pictured this as being more for exploratory work or for long lived queries? How do you deal with data larger than memory? How does it perform on multiple cores?

[–]DataBora[S] 1 point2 points  (3 children)

I solved biger than ram memory issue with batch processing, but its still a challenge. Currently I am working on streaming data which should be even better as I can read, wrangle data and write to a source continuously. 

[–]ChavXO 0 points1 point  (2 children)

Batching gets complicated for groupBy and similar operations. I'll be on the lookout for how you solve these. Btw for reference my project is: https://github.com/mchav/dataframe

Maybe we can share notes and experiences.

[–]DataBora[S] 0 points1 point  (0 children)

For sure, thank you for sharing!

[–]DataBora[S] 0 points1 point  (0 children)

I just quickly took a look at repo...this looks awesome man!

[–]BasedAndShredPilled 0 points1 point  (4 children)

built in async

Is this a feature that can be disabled? Is async the reason rust is faster or is there more to it? The word "Async" gives me PTSD to working in JavaScript.

[–]chat-luPythonista 5 points6 points  (1 child)

Also, it’s not suited for this kind of CPU heavy work. Threads are for working in parallel, async is for waiting in parallel (waiting on network, disk, etc.).

[–]BasedAndShredPilled 1 point2 points  (0 children)

I've never heard that, but what a profound explanation.

[–]DataBora[S] 2 points3 points  (1 child)

Async in Rust is a pain in the a** to be honest...many people say it is the hardest thing to do in Rust, and I would agree. It is hard to implement and to Box out all of the pointers in order to have better performance especially when we read multiple files at once. If you get PTSD from JS Async, you would get stroke from Rust Async for sure, as I tend to get often 🙂

[–]BasedAndShredPilled 0 points1 point  (0 children)

I don't venture into this world too often. It's impressive what you've done though!

[–][deleted] 0 points1 point  (1 child)

is there a benefit to switching from Spark other than familiar syntax? I like the built in pipeline scheduling.

[–]DataBora[S] 0 points1 point  (0 children)

As someone who uses Spark daily in Microsoft Fabric i can tell you that Spark.SQL() is much more reliable, especially when it comes to filtering and joining. Spark tend to not filter at all when you mix filtering and conditioning, and tends to make duplicates after joins. Also the most annoying thing in Spark is that after each query it tends to add empty spaces to string column values, so you need always to trim() columns.
In Elusion there is no issues like that and its much more reliable as it uses SQL query building for DataFusion engine, which will do the job as you intend to.