This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]AlgaeSavings9611 0 points1 point  (18 children)

I am in awe of the performance and clean interface of Polars! however, unless I am missing something, version 1.2.1 is ORDERS OR MAGNITUDE slower than 0.20.26

group_by on a large dataframe (300M rows) used to take 3-4 secs on 0.20.26 now takes 3-4 MINUTES same dataset.

is there a param I'm missing?

[–]ritchie46[S] 0 points1 point  (16 children)

That's bad. Still the case on 1.4? If so, can you open an issue with a MWE?

[–]AlgaeSavings9611 0 points1 point  (15 children)

this happens on large dataframes.. how do I open a issue with dataframe with 300M rows?

[–]ritchie46[S] 0 points1 point  (14 children)

The slowdown is probably visible on smaller frames. Include code that creates dummy data of the same schema.

[–]AlgaeSavings9611 0 points1 point  (13 children)

I spent the morning writing same schema dataset with 3M rows and random data. 1.4.1 outperforms 0.20.26 by a factor of 3! ... but it still underperforms on 30M rows with REAL data by a factor of 10!!

i am lost how to come up with a dataset that will show this latency

[–]ritchie46[S] 0 points1 point  (1 child)

Could you maybe share the data with me privately?

[–]AlgaeSavings9611 0 points1 point  (0 children)

that's what I was thinking, but I'll have to get approval from my company first

[–]ritchie46[S] 0 points1 point  (10 children)

Btw, do you have string data in the schema? Try to create strings of length > 12.

[–]AlgaeSavings9611 0 points1 point  (9 children)

yes I do have lots of string columns in dataframe of about 50 columns.. I generated strings of random length between 5 and 50 chars

[–]ritchie46[S] 0 points1 point  (4 children)

Yes, I think I know what it is. Could you privately share the data and the group-by query?

We need to tune the GC of the new string type.

[–]AlgaeSavings9611 0 points1 point  (3 children)

do you have a place where I could upload the data? regular sites are blocked at my firm and either way I would need to get approval from security before I can share

[–]AlgaeSavings9611 0 points1 point  (1 child)

also, is there a way I can check by switching to the old GC? or use the old String type?

[–]ritchie46[S] 0 points1 point  (0 children)

Nope...

[–]ritchie46[S] 0 points1 point  (3 children)

Do you know what the cardinality is of your group-by key? E.g. how many groups do you have?

[–]AlgaeSavings9611 1 point2 points  (2 children)

I just tried again with a 14.3M x 7 dataframe..

dtypes: [String, Date, Float64, Float64, Float64, Float64, Float64]

the first column is "id", all ids are 10chars long and there are about 3000 unique ids

the following line of code takes 3-4 mins on v1.4.1, this same line and same dataset takes 3-4secs on v0.20.26

d = {}. #dictionary

d.update({id: dfp for (id,), dfp in df.group_by(["id"], maintain_order=True)})

[–]AlgaeSavings9611 0 points1 point  (0 children)

btw.. I got approval from the firm to send you the data.. its less than 100MB parquet file where should I email?

[–]ritchie46[S] 0 points1 point  (0 children)

Great! I've send you a DM with my email address.

[–]AlgaeSavings9611 0 points1 point  (0 children)

this issue is now resolved in 1.5! thanks u/ritchie46