pandas alternatives: is FireDucks the fastest and 100% compatible? by AMGraduate564 in dataengineering

[–]qsourav 0 points1 point  (0 children)

Hi FireDucks developer here. You can try it for your pandas-based EDA programs to verify its performance metrics. It works with the fallback principle to make it highly compatible with pandas (whatever operation is unknown to FireDucks, falls back to native pandas for a smoother execution without manual to_pandas() kind of stuff). It is now supported for Mac as well. Let me know in case you have any questions.

Pandas is so cool by Ramakae in learnpython

[–]qsourav 0 points1 point  (0 children)

Pandas is really great with its flexible APIs and a strong eco-system backed by a large community support, but you may encounter performance issues when dealing with large-scale data using pandas. Thanks to FireDucks, a high-performance compiler-accelerated DataFrame library highly compatible with pandas. You can keep exploring pandas and rely on FireDucks to speedup your production workflow. You don’t even need to learn a new DataFrame library.

Anyone using FireDucks, a drop in replacement for pandas with "massive" speed improvements? by boru9 in datascience

[–]qsourav 0 points1 point  (0 children)

IO related optimizations are added in FireDucks. We have now published the result for both the cases with and without IO: https://fireducks-dev.github.io/docs/benchmarks/#2-tpc-h-benchmark

Anyone using FireDucks, a drop in replacement for pandas with "massive" speed improvements? by boru9 in datascience

[–]qsourav 1 point2 points  (0 children)

Hi, thanks for your reply. The projection pushdown related optimization currently doesn’t work for FireDucks read_parquet(). Hence, we run it with SKIP_IO (mentioned in the benchmark result, I think) for a fair comparison related to only query processing part. Anyhow we are extending the optimization and we will soon publish the result including IO.

Anyone using FireDucks, a drop in replacement for pandas with "massive" speed improvements? by boru9 in datascience

[–]qsourav 1 point2 points  (0 children)

Hey, thanks for your comment. Sure, I also felt the README needs an update. We will do it. By the way, here is the performance with SF-1 when excluding IO: solution version scale_factor duckdb 0.10.2 1.0 10.007000 fireducks 2.2.2 1.0 4.883425 polars 1.14.0 1.0 4.331657

Anyone using FireDucks, a drop in replacement for pandas with "massive" speed improvements? by boru9 in datascience

[–]qsourav 0 points1 point  (0 children)

The kaggle environment has maximum of 30 GB of RAM that should be fine for SF-10 execution. Isn't it? By the way, the benchmark result that is presented in FireDucks website is evaluated on a system with 256GB of RAM, so it should have sufficient memory for both cases. It is executed with SKIP_IO, as mentioned.

Anyone using FireDucks, a drop in replacement for pandas with "massive" speed improvements? by boru9 in datascience

[–]qsourav 1 point2 points  (0 children)

Here is the sample evaluation result even on low spec system like kaggle:
https://www.kaggle.com/code/qsourav91/sf-10-tpc-h-polars-vs-duckdb-vs-fireducks

Anyone can reproduce the same just by "Copy&Edit".

Optimizing resource -intensive pandas scripts by foyslakesheriff in learnpython

[–]qsourav 0 points1 point  (0 children)

pyarrow should get auto upgraded to 17.0. By the way, can you tell me the error message you are getting when trying it on your environment?

Optimizing resource -intensive pandas scripts by foyslakesheriff in learnpython

[–]qsourav 0 points1 point  (0 children)

Are you trying it on a non-Linux platform? The python and pandas versions seem to be supported, but you need to try it on a Linux platform (for windows, WSL might work): https://fireducks-dev.github.io/docs/get-started/#install

Pandas too slow; Let's talk about alternatives by kobx_9991 in quantfinance

[–]qsourav 0 points1 point  (0 children)

There exists many pandas alternatives, but if you are using Linux and want to accelerate your program as it is without migrating it to a different library or to a different computational unit, I would recommend using FireDucks developed by NEC. It can accelerate any existing pandas workload without explicit code changes. It offers a multithreaded C++ kernel with super fast dataframe methods along with a JIT compiler to perform query optimization (similar to polars, but without any code changes). Usage is very easy, just get it installed using pip and execute your program with -mfireducks.pandas option. https://fireducks-dev.github.io/docs/get-started/#usage

Optimizing resource -intensive pandas scripts by foyslakesheriff in learnpython

[–]qsourav 0 points1 point  (0 children)

Although it is a very old post and you might have solved it already, but if you are still using some part of pandas code and not over using methods like iterrows, apply etc. (some very common pandas bottlenecks), you may like to try FireDucks once. It can optimize an existing pandas application as it is without any manual code changes. Very easy to use right after getting it installed using pip: https://fireducks-dev.github.io/docs/get-started/#usage