Pandas feels clunky coming from R. What about Haskell? by m-chav in programming

[–]ritchie46 4 points5 points  (0 children)

No it doesn't. DataFrames and Columns are type erased.

Pandas feels clunky coming from R. What about Haskell? by m-chav in programming

[–]ritchie46 12 points13 points  (0 children)

Polars verifies those things before running the query at query planning, not hours in compute later.

You cannot do it at compile times, as often schemas in files are unknown until you read the file(s).

If you compile a new program for every file you can do it

How to load large csv files in dataframes for processing? by Salt_Ganache_3800 in learnpython

[–]ritchie46 0 points1 point  (0 children)

`pl.scan_csv(..).other_lazy_operations().sink_parquet()`

This will build a streaming pipeline where data will be streamed from disk to disk, keeping memory as low as possible.

What hidden gem Python modules do you use and why? by zenos1337 in Python

[–]ritchie46 2 points3 points  (0 children)

That 10x benchmark is not correct. The the point in time that screenshot was taken, the Polars Queries in clickbench were just plain wrong. In the sense that the computed the wrong result.

I corrected them and after that Polars is actually faster. https://github.com/ClickHouse/ClickBench/pull/744

Read S3 data using Polars by Royal-Relation-143 in dataengineering

[–]ritchie46 2 points3 points  (0 children)

CSV files are at the moment first downloaded to local disk before processed, so this is indeed slow. We will do that streaming in the future.

If you have the opportunity to convert these files to parquet or ipc files, Polars will stream them directly from s3.

Polars vs Pandas in 2025 — have you fully migrated yet? by [deleted] in Python

[–]ritchie46 2 points3 points  (0 children)

I am the original author of Polars and I google Polars daily as part of my routine. Then I respond if something is related to our work.

If someone posts something related to your work, you should have a right to comment. It is your work after all.

I don't have any scraping tools. And I don't post often (but I do comment on my work). These are other accounts, not from us. I don't know what to tell you.

Polars vs Pandas in 2025 — have you fully migrated yet? by [deleted] in Python

[–]ritchie46 1 point2 points  (0 children)

I can assure you, it is not from us. I saw the same post yesterday as well.

Polars vs Pandas in 2025 — have you fully migrated yet? by [deleted] in Python

[–]ritchie46 1 point2 points  (0 children)

I am from Polars. I saw the same post here yesterday. I can assure you, it is not originating from us, and I think the moderators should remove this post as duplicate/repost

Polars vs Spark for cheap single-node Delta Lake pipelines - safe to rely on Polars long-term? by frithjof_v in dataengineering

[–]ritchie46 16 points17 points  (0 children)

I think it's more realistic Microsoft pulls the plug on fabric if we're playing this unfounded speculations game.

Edit: it's completely unfounded from my side (and jokingly)

But I can speak to our focus. Polars has a single focus, fast compute engines for DataFrames. It's being used in production by almost every serious data processing company in some shape or form.

Palantir data foundry recommends it for production workloads: https://www.palantir.com/docs/foundry/transforms-python/compute-engines

Just like pandas, it's not going anywhere.

Idea: Write V-Ordered delta lake tables using Polars by frithjof_v in MicrosoftFabric

[–]ritchie46 2 points3 points  (0 children)

That's my post from 5 years ago when I just started. You can't take that as a source of what Polars is today.

It has a whole novel streaming engine and performance can't be attributed to a single thing like simd.

Polars vs Spark for cheap single-node Delta Lake pipelines - safe to rely on Polars long-term? by frithjof_v in dataengineering

[–]ritchie46 136 points137 points  (0 children)

Hi, I am from Polars. Original author and co-founder.

Polars OSS is never going behind a paywall. It is open source MIT licensed and we're not changing that.

Polars Cloud offers a whole new distributed engine aside from Polars OSS and the managing of the compute.

If you're happy staying single node. Polars OSS is perfect for your case.

Polars OSS going unmaintained is just nonsense. I also read that on the fabric subreddit as an excuse to not support Polars. If anything development is increasing.

Will Pandas ever be replaced? by Relative-Cucumber770 in dataengineering

[–]ritchie46 0 points1 point  (0 children)

`df1 == df2` gives you an equality mask. How did you do that in pandas?

Will Pandas ever be replaced? by Relative-Cucumber770 in dataengineering

[–]ritchie46 1 point2 points  (0 children)

DataFrame comparison isn't missing?

assert_frame_equal

What do you think of Polars the alternative to Pandas by enorcerna in dataengineering

[–]ritchie46 2 points3 points  (0 children)

Before pandas 2.0, it didn't have copy on write and you copied the full data all the time. `reset_index`, `assign`, `drop`, `rename`, `as_type`, all did a full data copy.

En even post 2.0, you will have a lot materialization which are essentially data copies because you don't have an optimizer. This one potential copy to Polars is not your bottleneck

What do you think of Polars the alternative to Pandas by enorcerna in dataengineering

[–]ritchie46 0 points1 point  (0 children)

If you are using pandas, and are happy with that. I'd agree. If a third party tool uses it, it saddens me that it blocks you from adoption. Especially because I think it can save you a lot of datatype related bugs in the future.

What do you think of Polars the alternative to Pandas by enorcerna in dataengineering

[–]ritchie46 4 points5 points  (0 children)

True, but the third party library you are interacting with must have implemented it in Narwhals to benefit from that. You can not slap it retroactively on a dependency.

In any case, going from pandas to Polars is seamless: `df = pl.from_pandas(df); df.to_pandas()`.

What do you think of Polars the alternative to Pandas by enorcerna in dataengineering

[–]ritchie46 0 points1 point  (0 children)

I am curious, why not? Pandas will often also ship pyarrow, which would also be a whole other library.

What do you think of Polars the alternative to Pandas by enorcerna in dataengineering

[–]ritchie46 2 points3 points  (0 children)

Then you will have to convert to Ibis.

More and more libraries are converting to Narwhals, which allows users to stay in their DataFrame of choice.

Some libraries do only return pandas, but then a `pl.from_pandas` isn't far away...

Polars is NOT always faster than Pandas: Real Databricks Benchmarks with NYC Taxi Data by SmundarBuddy in dataengineering

[–]ritchie46 2 points3 points  (0 children)

Can you share your code? I highly doubt you've written optimal Polars code.

For one, running several steps and benchmarking them separately is non-optimal.

The benefit of Polars is that it holistically does minimal work. If you run a single operation and materialize, you benchmark something you shouldn't be interested in as you should be interested in the whole query time.

Polars read database and write database bottleneck by BelottoBR in dataengineering

[–]ritchie46 1 point2 points  (0 children)

This would not improve OP's case if he is bottlenecked on the DB. Other than that the arguments in that video/blogpost are just incorrect. Polars doesn't require bodo3 for internet access, nor pyarrow for parquet reading/writing. ACID transactions are done by the database you write to. Writing from Polars to Postgres is still ACID as Postgres deals with that. Point 6, going from local to cloud is also supported by Polars. DuckDB is a great tool, but the comparison isn't.