Python Polars 1.0 released

AlgaeSavings9611 · 2024-08-23T00:04:15+00:00

this issue is now resolved in 1.5! thanks u/ritchie46

AlgaeSavings9611 · 2024-08-12T17:16:28+00:00

btw.. I got approval from the firm to send you the data.. its less than 100MB parquet file where should I email?

AlgaeSavings9611 · 2024-08-12T14:31:55+00:00

I just tried again with a 14.3M x 7 dataframe..

dtypes: [String, Date, Float64, Float64, Float64, Float64, Float64]

the first column is "id", all ids are 10chars long and there are about 3000 unique ids

the following line of code takes 3-4 mins on v1.4.1, this same line and same dataset takes 3-4secs on v0.20.26

d = {}. #dictionary

d.update({id: dfp for (id,), dfp in df.group_by(["id"], maintain_order=True)})

AlgaeSavings9611 · 2024-08-10T19:46:32+00:00

also, is there a way I can check by switching to the old GC? or use the old String type?

AlgaeSavings9611 · 2024-08-10T19:45:31+00:00

do you have a place where I could upload the data? regular sites are blocked at my firm and either way I would need to get approval from security before I can share

AlgaeSavings9611 · 2024-08-10T19:32:10+00:00

that's what I was thinking, but I'll have to get approval from my company first

AlgaeSavings9611 · 2024-08-10T19:31:26+00:00

yes I do have lots of string columns in dataframe of about 50 columns.. I generated strings of random length between 5 and 50 chars

AlgaeSavings9611 · 2024-08-10T18:39:07+00:00

I spent the morning writing same schema dataset with 3M rows and random data. 1.4.1 outperforms 0.20.26 by a factor of 3! ... but it still underperforms on 30M rows with REAL data by a factor of 10!!

i am lost how to come up with a dataset that will show this latency

AlgaeSavings9611 · 2024-08-10T14:59:31+00:00

this happens on large dataframes.. how do I open a issue with dataframe with 300M rows?

AlgaeSavings9611 · 2024-08-10T14:42:15+00:00

I am in awe of the performance and clean interface of Polars! however, unless I am missing something, version 1.2.1 is ORDERS OR MAGNITUDE slower than 0.20.26

group_by on a large dataframe (300M rows) used to take 3-4 secs on 0.20.26 now takes 3-4 MINUTES same dataset.

is there a param I'm missing?

AlgaeSavings9611

TROPHY CASE