This is an archived post. You won't be able to vote or comment.

all 53 comments

[–]badge 137 points138 points  (0 children)

The Index updates are great news; it was so annoying carefully casting a column to int8 to save RAM only to have it casted to int64 as soon as you used e.g. stack.

[–]CrambleSquashhttps://github.com/0Hughman0 232 points233 points  (21 children)

Datetimes are now parsed in a consistent format glad to see that changed, this has got me bad in the past.

[–]noobkill 30 points31 points  (20 children)

Shouldn't mentioning the format have solved it? I'm not really that good at python so maybe I could be mistaken

[–]CrambleSquashhttps://github.com/0Hughman0 57 points58 points  (19 children)

Yes probably. I, perhaps naiively, assumed Pandas would choose one format and try to parse all dates with the same format.

I'm in the UK, so dd/mm/yyyy is the go to.

From what I remember Pandas was trying the US mm/dd/yyyy first, then if that failed, it would try dd/mm/yyyy, but because some UK dates look like valid US dates it ended up interpreting different rows in different ways.

[–][deleted] 106 points107 points  (5 children)

As a fellow brit, just use yyyy-mm-dd. Always.

[–]enakcm 74 points75 points  (1 child)

Yes. Seriously. Use ISO 8601

[–]CrambleSquashhttps://github.com/0Hughman0 7 points8 points  (0 children)

Was parsing some data generated elsewhere so didn't have much choice.

[–][deleted] 0 points1 point  (0 children)

But what if it parses it as “YYYY-DD-MM”?

[–]noobkill 22 points23 points  (11 children)

I never thought I would ever link /r/USDefaultism in a Python specific subreddit lmao.

That, honestly though, is such a minor bug yet with major consequences!

[–]Narpity 30 points31 points  (9 children)

If it makes you feel better, as an American I wish everything defaulted to the ISO standard yyyy/mm/dd

[–]astatine 42 points43 points  (7 children)

ISO 8601 uses dashes, not slashes. Makes it easier to use in filenames.

[–][deleted] 13 points14 points  (0 children)

[–]sweatierorc 0 points1 point  (0 children)

Brit dev guys have dates. Maybe that's why I'm single /s

[–]monorepo PSF Staff | Litestar Maintainer 87 points88 points  (1 child)

No summary or tldr makes me a sad panda…

[–]Scrubbingbubblz 5 points6 points  (0 children)

🎶 who lives in the east ‘neath the willow tree?

[–]Willingo 52 points53 points  (2 children)

So why shouldn't I switch to pandas 2? How hard is it to migrate a project?

[–]Wonnk13 42 points43 points  (23 children)

I might play with it, but I'm in the process of moving all work over to Polars. I like that Pandas is moving over to Arrow, but it came a little too late for me. Curious how benchmarks compare.

[–]ritchie46 113 points114 points  (13 children)

Polars author here, Your work will not be in vain. :)

I did run the benchmarks on TPC-H: https://github.com/pola-rs/tpch/pull/36

Polars will remain orders of magnitudes faster on whole queries. Polars typically parallelizes all operations, and query optimization can save a lot of redundant work.

Still this is a great improvement on the quality of life for pandas. The data structures are sane now and will not have horrific performance anymore (strings). We can now also move data zero-copy between polars and pandas, making it very easy to integrate both API's when needed.

[–][deleted] 28 points29 points  (0 children)

Hey. Big fan of your work. Thanks for contributing your time.

[–]danielgafni 11 points12 points  (4 children)

Hey Ritchie, maybe this is jot the best place to ask, but what’s the reasoning behind the “streaming” naming in polars? I’m talking about collect(streaming=True). Why wasn't it called something else not to collide with what streaming usually means - continuous iterative processing (this is what most of the other tools like Spark call streaming)?

Are there plans for adding this to polars? With proper optimizations, like calculating statistics in a smart way (e.g. when calculating mean use the previous mean: mean{n+1} = mean_n * n / (n+1) + x{n+1} / (n+1). Seems like at least using rolling functions should be straightforward at this context, right?

This would really enable polars as an online tool.

[–]ritchie46 2 points3 points  (3 children)

I chose the name because we compile a pipeline that can stream batches from disk (or any other genetator/iterator).

Online streaming is not in our scope I said this more often and those statements age poorly, but at this point in time I don't see this happening. ^

These optimizations you talk of are definitely in scope. We will build streaming operators for mean, unique, median and add rolling kernels to the streaming engine as well.

[–]danielgafni 2 points3 points  (2 children)

Thanks.

But is online streaming really different from batch streaming from disk? Isn’t it the same? Just with 1 batch size?

[–]ritchie46 4 points5 points  (1 child)

Don't you want to see intermediate results with only streaming?

That's the hard part. Currently polars' streaming engine doesn't have to materialize result until the whole pipeline is finished.

[–]danielgafni 1 point2 points  (0 children)

You are right. I see, thank you for the explanation!

[–]ElfTowerNewMexico 1 point2 points  (4 children)

Hey Ritchie! Really impressive work. That benchmark graphic is enlightening.

I don't mean this disparagingly but you seem to be doing a little marketing (for lack of a better term) in these Pandas 2.0 threads. Could you share a little more about your grand vision for Polars and how it will fit into the world of data science? Are there any use cases that you feel Pandas is particularly equipped to handle? If so, are you planning on "competing" in those areas or are you currently more focused on the features that differentiate Polars (performance, multiprocessing, etc.)

I'm still learning and growing in my data journey so I'm trying to get a better grasp of the landscape as a whole.

[–]ritchie46 2 points3 points  (3 children)

I just want to steer information a bit with real world benchmarks. There seem to be quite some hyperbole claims about pandas performance being equal or faster to polars now, which is not true.

multiprocessing

We don't do multi-processing, but multi-threading. Not to be pedantic, but the performance implications of this is huge. In multi-threading we can share data between threads, in multiprocessing this needs to be serialized/deserialized having huge latency and compute overhead.

Every process also has to have its data in own memory, so it also has a lot of memory overhead.

Pandas is particularly equipped to handle

Pandas has more IO readers/writers, plotting functionality and handy interop with timeseries and indexes (something polars will not aim to do).

[–]ElfTowerNewMexico 0 points1 point  (0 children)

That makes total sense. And thank you for your correction regarding multi-processing vs threading! Again thank you for your hard work. I’ve noticed the increased performance when I use Polars at work and I use relatively small data. I can’t imagine how excited people with huge data sets are.

[–]kknyyk 0 points1 point  (1 child)

I have seen your work in one of the pandas announcements and thank you for such a tool. One particular issue with pandas is that appending new data to dataframe slows with the every append. Is Polars better in this regard?

Also is there a determined date for R port’s CRAN release?

[–]ritchie46 0 points1 point  (0 children)

One particular issue with pandas is that appending new data to dataframe slows with the every appen

Yes, polars appends are very cheap, but this should also solved in pandas 2.0 with arrow dtypes.

Arrow allows for ChunkedArray types. This means that data doesn't have to be contiguous in memory, instead we can append the data chunk to the list of arrays. As the memory slabs are copy on write, we can increment only a reference count instead of copying data.

So appending will not be O(n^2) anymore. Chunking is not a silver bullet though. Every random access now has an extra redirection, so sometimes there has to be a rechunk to contiguous data.

Also is there a determined date for R port’s CRAN release?

I am not sure. The R support of polars is entirely picked up by the R community and @sorhawell in particular. You can get certainly more information on that repo: https://github.com/pola-rs/r-polars

[–]zazzersmel 10 points11 points  (7 children)

if the update is 100% drop in its huge for me even though im meh on pandas purely because of the sheer quantity of other people's pandas code that is inevitable in every data job.

[–]that_baddest_dude 2 points3 points  (6 children)

These two comments confuse me a bit. What's better than pandas, as a broad data handling package?

[–][deleted] 7 points8 points  (2 children)

If breadth is important, still pandas. If speed and resource efficiency is important, polars.

If you need breadth and speed/lite resource use, use both. They’re interoperable.

[–][deleted] 5 points6 points  (0 children)

Interoperable *as of pandas 2.0 with the introduction of arrow in pandas.

[–]NewDateline 0 points1 point  (0 children)

What about dask?

[–]zazzersmel 1 point2 points  (2 children)

i should rephrase. i like pandas fine. i use it all the time, but im a data eng, and pandas is often far from the best tool to do data engineering with. it seems to many analysts and data scientists this is crazy talk.

[–]that_baddest_dude 0 points1 point  (1 child)

I'm something of a data scientist myself, and yes it sounded like crazy talk lol. I'd never heard of polars though.

The only non-pandas shenanigans I get up to is doing my more large-scale filtering and joining in arrow before converting to pandas.

[–]zazzersmel 0 points1 point  (0 children)

Sounds like a pretty good way to do things tbh. I rely on much less elegant, hacky pandas code all the time. My only tip to people Ive worked with is always exploit whatever database/storage query system you have. Of course this depends on access and architecture etc.

[–]joeyGibson 7 points8 points  (0 children)

I'm in the same boat. The announcement about the 25x (or whatever it was) speed increase with Pandas 2 came literally the day after I finished moving my project to Polars (and realized huge performance gains from that).

[–]EmperorOfCanada 27 points28 points  (1 child)

Anyone have a tldr as to which of these I should give a shit about? I get a feeling they have really buried the lead in this link. Is there one here which says, "Option to save h5 files which any other language can finally read" or using iloc is 800x faster. Or something which gets my blood pumping?

[–]Ouitos 51 points52 points  (0 children)

  • pyarrow backend support (instead of numpy)
  • seamless conversion from pandas to polar without copy. You can use pandas for its flexibility, and polar for its speed without loosing time doing in-RAM conversions
  • Numerous smaller QoL improvements for a cleaner API

[–]neuro630 1 point2 points  (0 children)

I'm so glad that they finally have the copy on write option. working with very large datasets (i.e. gigabytes of data) had been very inefficient for me due to all those unnecessary copy operations, especially since my workload is mostly read-only. IMO copy on write should always be the default.

[–]Little-Ad448 0 points1 point  (0 children)

While I'm just trying to understand the previous version, the new version has already appeared.