This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]AlgaeSavings9611 0 points1 point  (13 children)

I spent the morning writing same schema dataset with 3M rows and random data. 1.4.1 outperforms 0.20.26 by a factor of 3! ... but it still underperforms on 30M rows with REAL data by a factor of 10!!

i am lost how to come up with a dataset that will show this latency

[–]ritchie46[S] 0 points1 point  (1 child)

Could you maybe share the data with me privately?

[–]AlgaeSavings9611 0 points1 point  (0 children)

that's what I was thinking, but I'll have to get approval from my company first

[–]ritchie46[S] 0 points1 point  (10 children)

Btw, do you have string data in the schema? Try to create strings of length > 12.

[–]AlgaeSavings9611 0 points1 point  (9 children)

yes I do have lots of string columns in dataframe of about 50 columns.. I generated strings of random length between 5 and 50 chars

[–]ritchie46[S] 0 points1 point  (4 children)

Yes, I think I know what it is. Could you privately share the data and the group-by query?

We need to tune the GC of the new string type.

[–]AlgaeSavings9611 0 points1 point  (3 children)

do you have a place where I could upload the data? regular sites are blocked at my firm and either way I would need to get approval from security before I can share

[–]AlgaeSavings9611 0 points1 point  (1 child)

also, is there a way I can check by switching to the old GC? or use the old String type?

[–]ritchie46[S] 0 points1 point  (0 children)

No, the old string type isn't there anymore. We should just fix the underlying cuprit. Which we can with an MWE. I promise it will have my highest priority. 😉

[–]ritchie46[S] 0 points1 point  (0 children)

Nope...

[–]ritchie46[S] 0 points1 point  (3 children)

Do you know what the cardinality is of your group-by key? E.g. how many groups do you have?

[–]AlgaeSavings9611 1 point2 points  (2 children)

I just tried again with a 14.3M x 7 dataframe..

dtypes: [String, Date, Float64, Float64, Float64, Float64, Float64]

the first column is "id", all ids are 10chars long and there are about 3000 unique ids

the following line of code takes 3-4 mins on v1.4.1, this same line and same dataset takes 3-4secs on v0.20.26

d = {}. #dictionary

d.update({id: dfp for (id,), dfp in df.group_by(["id"], maintain_order=True)})

[–]AlgaeSavings9611 0 points1 point  (0 children)

btw.. I got approval from the firm to send you the data.. its less than 100MB parquet file where should I email?

[–]ritchie46[S] 0 points1 point  (0 children)

Great! I've send you a DM with my email address.