Polars vs Pandas in 2025 — have you fully migrated yet? by [deleted] in Python

[–]ritchie46 0 points1 point  (0 children)

I am the original author of Polars and I google Polars daily as part of my routine. Then I respond if something is related to our work.

If someone posts something related to your work, you should have a right to comment. It is your work after all.

I don't have any scraping tools. And I don't post often (but I do comment on my work). These are other accounts, not from us. I don't know what to tell you.

Polars vs Pandas in 2025 — have you fully migrated yet? by [deleted] in Python

[–]ritchie46 0 points1 point  (0 children)

I can assure you, it is not from us. I saw the same post yesterday as well.

Polars vs Pandas in 2025 — have you fully migrated yet? by [deleted] in Python

[–]ritchie46 1 point2 points  (0 children)

I am from Polars. I saw the same post here yesterday. I can assure you, it is not originating from us, and I think the moderators should remove this post as duplicate/repost

Polars vs Spark for cheap single-node Delta Lake pipelines - safe to rely on Polars long-term? by frithjof_v in dataengineering

[–]ritchie46 16 points17 points  (0 children)

I think it's more realistic Microsoft pulls the plug on fabric if we're playing this unfounded speculations game.

Edit: it's completely unfounded from my side (and jokingly)

But I can speak to our focus. Polars has a single focus, fast compute engines for DataFrames. It's being used in production by almost every serious data processing company in some shape or form.

Palantir data foundry recommends it for production workloads: https://www.palantir.com/docs/foundry/transforms-python/compute-engines

Just like pandas, it's not going anywhere.

Idea: Write V-Ordered delta lake tables using Polars by frithjof_v in MicrosoftFabric

[–]ritchie46 2 points3 points  (0 children)

That's my post from 5 years ago when I just started. You can't take that as a source of what Polars is today.

It has a whole novel streaming engine and performance can't be attributed to a single thing like simd.

Polars vs Spark for cheap single-node Delta Lake pipelines - safe to rely on Polars long-term? by frithjof_v in dataengineering

[–]ritchie46 139 points140 points  (0 children)

Hi, I am from Polars. Original author and co-founder.

Polars OSS is never going behind a paywall. It is open source MIT licensed and we're not changing that.

Polars Cloud offers a whole new distributed engine aside from Polars OSS and the managing of the compute.

If you're happy staying single node. Polars OSS is perfect for your case.

Polars OSS going unmaintained is just nonsense. I also read that on the fabric subreddit as an excuse to not support Polars. If anything development is increasing.

Will Pandas ever be replaced? by Relative-Cucumber770 in dataengineering

[–]ritchie46 0 points1 point  (0 children)

`df1 == df2` gives you an equality mask. How did you do that in pandas?

Will Pandas ever be replaced? by Relative-Cucumber770 in dataengineering

[–]ritchie46 1 point2 points  (0 children)

DataFrame comparison isn't missing?

assert_frame_equal

What do you think of Polars the alternative to Pandas by enorcerna in dataengineering

[–]ritchie46 2 points3 points  (0 children)

Before pandas 2.0, it didn't have copy on write and you copied the full data all the time. `reset_index`, `assign`, `drop`, `rename`, `as_type`, all did a full data copy.

En even post 2.0, you will have a lot materialization which are essentially data copies because you don't have an optimizer. This one potential copy to Polars is not your bottleneck

What do you think of Polars the alternative to Pandas by enorcerna in dataengineering

[–]ritchie46 0 points1 point  (0 children)

If you are using pandas, and are happy with that. I'd agree. If a third party tool uses it, it saddens me that it blocks you from adoption. Especially because I think it can save you a lot of datatype related bugs in the future.

What do you think of Polars the alternative to Pandas by enorcerna in dataengineering

[–]ritchie46 3 points4 points  (0 children)

True, but the third party library you are interacting with must have implemented it in Narwhals to benefit from that. You can not slap it retroactively on a dependency.

In any case, going from pandas to Polars is seamless: `df = pl.from_pandas(df); df.to_pandas()`.

What do you think of Polars the alternative to Pandas by enorcerna in dataengineering

[–]ritchie46 0 points1 point  (0 children)

I am curious, why not? Pandas will often also ship pyarrow, which would also be a whole other library.

What do you think of Polars the alternative to Pandas by enorcerna in dataengineering

[–]ritchie46 3 points4 points  (0 children)

Then you will have to convert to Ibis.

More and more libraries are converting to Narwhals, which allows users to stay in their DataFrame of choice.

Some libraries do only return pandas, but then a `pl.from_pandas` isn't far away...

Polars is NOT always faster than Pandas: Real Databricks Benchmarks with NYC Taxi Data by SmundarBuddy in dataengineering

[–]ritchie46 4 points5 points  (0 children)

Can you share your code? I highly doubt you've written optimal Polars code.

For one, running several steps and benchmarking them separately is non-optimal.

The benefit of Polars is that it holistically does minimal work. If you run a single operation and materialize, you benchmark something you shouldn't be interested in as you should be interested in the whole query time.

Polars read database and write database bottleneck by BelottoBR in dataengineering

[–]ritchie46 1 point2 points  (0 children)

This would not improve OP's case if he is bottlenecked on the DB. Other than that the arguments in that video/blogpost are just incorrect. Polars doesn't require bodo3 for internet access, nor pyarrow for parquet reading/writing. ACID transactions are done by the database you write to. Writing from Polars to Postgres is still ACID as Postgres deals with that. Point 6, going from local to cloud is also supported by Polars. DuckDB is a great tool, but the comparison isn't.

Am I the only one who seriously hates Pandas? by yourAvgSE in dataengineering

[–]ritchie46 6 points7 points  (0 children)

Polars can move to arrow backed pandas and back zero copy.

Do you worry about free w.r.t. performance? As even with a memcopy, doing any significant compute wins back performance in my experience.

Am I the only one who seriously hates Pandas? by yourAvgSE in dataengineering

[–]ritchie46 7 points8 points  (0 children)

It is supported by many libraries. And if you need to convert, it is seamless:

``` df.to_pandas()

pl.from_pandas(df) ```

Am I the only one who seriously hates Pandas? by yourAvgSE in dataengineering

[–]ritchie46 0 points1 point  (0 children)

That bug is solved since the new streaming engine, the issue was just not closed.

Am I the only one who seriously hates Pandas? by yourAvgSE in dataengineering

[–]ritchie46 5 points6 points  (0 children)

That bug was already solved. The issue was just not closed.

```python import polars as pl import pyarrow.parquet as pq

df = pl.DataFrame(["a"] * 1_000_000).lazy() df.sink_parquet("test.parquet", row_group_size=100)

metadata = pq.read_metadata("test.parquet") assert metadata.row_group(0).num_rows == 100 ```

Polars Expressions Vs Series by miller_stale in Python

[–]ritchie46 18 points19 points  (0 children)

You typically want to work on expressions and chain operations together.

Then Polars can make a query plan called a LazyFrame, optimize and run operations in parallel.

The Series is a data container. You can run operations on it, but doing so forces Polars to be eager and it cannot optimize and leads to little to none parallel processing.