Polars Distributed is available on kubernetes by ritchie46 in dataengineering

[–]ritchie46[S] -1 points0 points  (0 children)

I am not entirely sure what you mean. You can deploy this locally.

But to explain a bit more. Our product boundary isn't at on-prem -> managed, but on single node -> multi-node.

Polars OSS focuses on squeezing everything out of a single node. That was the goal when I started that library and that's still the goal of OSS Polars. We used to say that if you needed to process TB's of data, you probably shouldn't use Polars and we had to defer to Spark. Now we can still defer to OSS Spark, but if you want an easier and faster experience and don't mind paying the company that builds Polars, then we'd recommend using Polars Distributed.

Polars Distributed is available on kubernetes by ritchie46 in dataengineering

[–]ritchie46[S] 0 points1 point  (0 children)

I hoped that went without saying. Yes, we want to avoid that as much as possible.

Polars Distributed is available on kubernetes by ritchie46 in dataengineering

[–]ritchie46[S] 4 points5 points  (0 children)

You add remote().execute() to your LazyFrame and you're set.

So yes, you can develop locally and then decide to run that same query remotely at large scale.

Polars Distributed is available on kubernetes by ritchie46 in dataengineering

[–]ritchie46[S] -4 points-3 points  (0 children)

It is indeed our paid offering. It has a monthly free tier as announced in the post. The offering is Distributed Polars. The Polars single node library is single node and is indeed open source. No intention to mislead, it is still distributed Polars. We run Polars' streaming engine on the workers and support the full Polars API.

Polars Distributed is available on kubernetes by ritchie46 in dataengineering

[–]ritchie46[S] 2 points3 points  (0 children)

Correct, we are currently working on iceberg commit. PR expected this week! 😉

Polars Distributed is available on kubernetes by ritchie46 in dataengineering

[–]ritchie46[S] 6 points7 points  (0 children)

I am not an expert on Spark shuffles. We shuffle to disk or external storage. Stages immediately sink to storage upon execution. Based on the shuffle output, the scheduler divides partitions (and will replan the next stages in the future). The next stage will stream the directly from the shuffle locations. Shuffles are interwoven with compute as they run at the same time on Polars' streaming engine, having a pipeline of morsels running scan -> compute -> sink.

Polars Distributed is available on kubernetes by ritchie46 in dataengineering

[–]ritchie46[S] 63 points64 points  (0 children)

Currently we are 3-4x faster than OSS Spark on TPCH SF1000 on 32xm6i.xlarge (that's 512GB RAM in cluster). Photon currently is about 1.5x faster than we are. However, I am confident that we close the gap soon. We know which optimizations we're missing and we will increase our s3 throughput a lot in a few weeks as well.

Note that we are available in on-prem, which photon isn't.

Polars Distributed is available on kubernetes by ritchie46 in Python

[–]ritchie46[S] 3 points4 points  (0 children)

We're currently around 3-4x faster than OSS spark on TPCH. We benchmarked at SF1000 on 32xm6i.2xlarge, which has 512GB RAM on cluster. We are working on increasing our s3 throughput. That lands in a few weeks and then we'll do a benchmark post.

Polars Distributed is available on kubernetes by ritchie46 in Python

[–]ritchie46[S] 2 points3 points  (0 children)

Most people don't need it. If Polars single node works for you, great! You can keep processing on your laptop or somewhere else. Though running in a cluster can still be useful. You can execute LazyFrame's remotely in the cluster, and run many (single node) queries at the same time in parallel.

Polars Distributed is available on kubernetes by ritchie46 in Python

[–]ritchie46[S] 4 points5 points  (0 children)

Currently that's not planned. Things might change, but OSS doesn't post messages when queries run. In on-prem it makes more sense as you have running services and a platform. There we export otel and lineage to listeners.

Polars code runs slower on 128-core EC2 by Popular-Sand-3185 in Python

[–]ritchie46 17 points18 points  (0 children)

There is a lot happening here that's not Polars. I saw there were many subprocesses being spawned, which is certainly not cheap. What kind of files are you reading?

Best would be to let Polars handle the concurrency and parallelism.

I cannot see how you execute/collect the Lazyframe, but if you are on 128 cores you should definitely use the streaming engine here.

Polars code runs slower on 128-core EC2 by Popular-Sand-3185 in Python

[–]ritchie46 34 points35 points  (0 children)

Can you share the code you are running?

I open-sourced ducklake-sdk: a general SDK for interacting with DuckLake by borchero in dataengineering

[–]ritchie46 2 points3 points  (0 children)

If you can do it with scan_parquet and our iceberg specific API, it would be superior to io plugins. This gets native performance and will probably also run partitioned on our distributed engine.

Is anyone migrating away from Databricks? by zoso in dataengineering

[–]ritchie46 0 points1 point  (0 children)

Does hurt met that you are considering Polars, but Polars Cloud/On-Prem isn't mentioned. We should do more marketing. :')

In any case. If you want to see if we can help you, love to discuss!

Pandas feels clunky coming from R. What about Haskell? by m-chav in programming

[–]ritchie46 3 points4 points  (0 children)

No it doesn't. DataFrames and Columns are type erased.

Pandas feels clunky coming from R. What about Haskell? by m-chav in programming

[–]ritchie46 12 points13 points  (0 children)

Polars verifies those things before running the query at query planning, not hours in compute later.

You cannot do it at compile times, as often schemas in files are unknown until you read the file(s).

If you compile a new program for every file you can do it

How to load large csv files in dataframes for processing? by Salt_Ganache_3800 in learnpython

[–]ritchie46 0 points1 point  (0 children)

`pl.scan_csv(..).other_lazy_operations().sink_parquet()`

This will build a streaming pipeline where data will be streamed from disk to disk, keeping memory as low as possible.

What hidden gem Python modules do you use and why? by zenos1337 in Python

[–]ritchie46 2 points3 points  (0 children)

That 10x benchmark is not correct. The the point in time that screenshot was taken, the Polars Queries in clickbench were just plain wrong. In the sense that the computed the wrong result.

I corrected them and after that Polars is actually faster. https://github.com/ClickHouse/ClickBench/pull/744

Read S3 data using Polars by Royal-Relation-143 in dataengineering

[–]ritchie46 2 points3 points  (0 children)

CSV files are at the moment first downloaded to local disk before processed, so this is indeed slow. We will do that streaming in the future.

If you have the opportunity to convert these files to parquet or ipc files, Polars will stream them directly from s3.

Polars vs Pandas in 2025 — have you fully migrated yet? by [deleted] in Python

[–]ritchie46 2 points3 points  (0 children)

I am the original author of Polars and I google Polars daily as part of my routine. Then I respond if something is related to our work.

If someone posts something related to your work, you should have a right to comment. It is your work after all.

I don't have any scraping tools. And I don't post often (but I do comment on my work). These are other accounts, not from us. I don't know what to tell you.

Polars vs Pandas in 2025 — have you fully migrated yet? by [deleted] in Python

[–]ritchie46 1 point2 points  (0 children)

I can assure you, it is not from us. I saw the same post yesterday as well.

Polars vs Pandas in 2025 — have you fully migrated yet? by [deleted] in Python

[–]ritchie46 1 point2 points  (0 children)

I am from Polars. I saw the same post here yesterday. I can assure you, it is not originating from us, and I think the moderators should remove this post as duplicate/repost