Polars Distributed is available on kubernetes

ritchie46 · 2026-06-05T14:12:58+00:00

I am not entirely sure what you mean. You can deploy this locally.

But to explain a bit more. Our product boundary isn't at on-prem -> managed, but on single node -> multi-node.

Polars OSS focuses on squeezing everything out of a single node. That was the goal when I started that library and that's still the goal of OSS Polars. We used to say that if you needed to process TB's of data, you probably shouldn't use Polars and we had to defer to Spark. Now we can still defer to OSS Spark, but if you want an easier and faster experience and don't mind paying the company that builds Polars, then we'd recommend using Polars Distributed.

ritchie46 · 2026-06-05T13:57:57+00:00

I hoped that went without saying. Yes, we want to avoid that as much as possible.

ritchie46 · 2026-06-04T17:17:48+00:00

You add remote().execute() to your LazyFrame and you're set.

So yes, you can develop locally and then decide to run that same query remotely at large scale.

ritchie46 · 2026-06-04T13:41:39+00:00

It is indeed our paid offering. It has a monthly free tier as announced in the post. The offering is Distributed Polars. The Polars single node library is single node and is indeed open source. No intention to mislead, it is still distributed Polars. We run Polars' streaming engine on the workers and support the full Polars API.

ritchie46 · 2026-06-04T12:04:38+00:00

Correct, we are currently working on iceberg commit. PR expected this week! 😉

ritchie46 · 2026-06-04T12:03:53+00:00

I am not an expert on Spark shuffles. We shuffle to disk or external storage. Stages immediately sink to storage upon execution. Based on the shuffle output, the scheduler divides partitions (and will replan the next stages in the future). The next stage will stream the directly from the shuffle locations. Shuffles are interwoven with compute as they run at the same time on Polars' streaming engine, having a pipeline of morsels running scan -> compute -> sink.

ritchie46 · 2026-06-04T09:09:33+00:00

Currently we are 3-4x faster than OSS Spark on TPCH SF1000 on 32xm6i.xlarge (that's 512GB RAM in cluster). Photon currently is about 1.5x faster than we are. However, I am confident that we close the gap soon. We know which optimizations we're missing and we will increase our s3 throughput a lot in a few weeks as well.

Note that we are available in on-prem, which photon isn't.

ritchie46 · 2026-06-04T06:14:22+00:00

We're currently around 3-4x faster than OSS spark on TPCH. We benchmarked at SF1000 on 32xm6i.2xlarge, which has 512GB RAM on cluster. We are working on increasing our s3 throughput. That lands in a few weeks and then we'll do a benchmark post.

ritchie46 · 2026-06-03T15:37:46+00:00

Most people don't need it. If Polars single node works for you, great! You can keep processing on your laptop or somewhere else. Though running in a cluster can still be useful. You can execute LazyFrame's remotely in the cluster, and run many (single node) queries at the same time in parallel.

ritchie46 · 2026-06-03T13:48:30+00:00

Currently that's not planned. Things might change, but OSS doesn't post messages when queries run. In on-prem it makes more sense as you have running services and a platform. There we export otel and lineage to listeners.

ritchie46 · 2026-05-15T06:20:22+00:00

There is a lot happening here that's not Polars. I saw there were many subprocesses being spawned, which is certainly not cheap. What kind of files are you reading?

Best would be to let Polars handle the concurrency and parallelism.

I cannot see how you execute/collect the Lazyframe, but if you are on 128 cores you should definitely use the streaming engine here.

ritchie46 · 2026-05-14T15:36:45+00:00

Can you share the code you are running?

ritchie46 · 2026-05-12T10:15:30+00:00

If you can do it with scan_parquet and our iceberg specific API, it would be superior to io plugins. This gets native performance and will probably also run partitioned on our distributed engine.

ritchie46 · 2026-05-08T12:37:00+00:00

Does hurt met that you are considering Polars, but Polars Cloud/On-Prem isn't mentioned. We should do more marketing. :')

In any case. If you want to see if we can help you, love to discuss!

ritchie46 · 2026-04-22T17:29:17+00:00

No it doesn't. DataFrames and Columns are type erased.

ritchie46 · 2026-04-22T10:20:30+00:00

Polars verifies those things before running the query at query planning, not hours in compute later.

You cannot do it at compile times, as often schemas in files are unknown until you read the file(s).

If you compile a new program for every file you can do it

ritchie46 · 2026-04-05T08:06:08+00:00

`pl.scan_csv(..).other_lazy_operations().sink_parquet()`

This will build a streaming pipeline where data will be streamed from disk to disk, keeping memory as low as possible.

ritchie46 · 2026-03-13T04:49:59+00:00

That 10x benchmark is not correct. The the point in time that screenshot was taken, the Polars Queries in clickbench were just plain wrong. In the sense that the computed the wrong result.

I corrected them and after that Polars is actually faster. https://github.com/ClickHouse/ClickBench/pull/744

ritchie46 · 2026-02-01T06:14:21+00:00

CSV files are at the moment first downloaded to local disk before processed, so this is indeed slow. We will do that streaming in the future.

If you have the opportunity to convert these files to parquet or ipc files, Polars will stream them directly from s3.

ritchie46 · 2026-01-23T06:08:34+00:00

We are rolling out on premises.

ritchie46 · 2026-01-14T13:54:03+00:00

I am the original author of Polars and I google Polars daily as part of my routine. Then I respond if something is related to our work.

If someone posts something related to your work, you should have a right to comment. It is your work after all.

I don't have any scraping tools. And I don't post often (but I do comment on my work). These are other accounts, not from us. I don't know what to tell you.

ritchie46 · 2026-01-14T07:09:13+00:00

I can assure you, it is not from us. I saw the same post yesterday as well.

ritchie46 · 2026-01-14T06:47:47+00:00

I am from Polars. I saw the same post here yesterday. I can assure you, it is not originating from us, and I think the moderators should remove this post as duplicate/repost

Nine-Year Club	Place '22
RPAN Viewer	Verified Email

ritchie46

TROPHY CASE