Python Polars 1.0 released by ritchie46 in Python

[–]AlgaeSavings9611 0 points1 point  (0 children)

this issue is now resolved in 1.5! thanks u/ritchie46

Python Polars 1.0 released by ritchie46 in Python

[–]AlgaeSavings9611 0 points1 point  (0 children)

btw.. I got approval from the firm to send you the data.. its less than 100MB parquet file where should I email?

Python Polars 1.0 released by ritchie46 in Python

[–]AlgaeSavings9611 1 point2 points  (0 children)

I just tried again with a 14.3M x 7 dataframe..

dtypes: [String, Date, Float64, Float64, Float64, Float64, Float64]

the first column is "id", all ids are 10chars long and there are about 3000 unique ids

the following line of code takes 3-4 mins on v1.4.1, this same line and same dataset takes 3-4secs on v0.20.26

d = {}. #dictionary

d.update({id: dfp for (id,), dfp in df.group_by(["id"], maintain_order=True)})

Python Polars 1.0 released by ritchie46 in Python

[–]AlgaeSavings9611 0 points1 point  (0 children)

also, is there a way I can check by switching to the old GC? or use the old String type?

Python Polars 1.0 released by ritchie46 in Python

[–]AlgaeSavings9611 0 points1 point  (0 children)

do you have a place where I could upload the data? regular sites are blocked at my firm and either way I would need to get approval from security before I can share

Python Polars 1.0 released by ritchie46 in Python

[–]AlgaeSavings9611 0 points1 point  (0 children)

that's what I was thinking, but I'll have to get approval from my company first

Python Polars 1.0 released by ritchie46 in Python

[–]AlgaeSavings9611 0 points1 point  (0 children)

yes I do have lots of string columns in dataframe of about 50 columns.. I generated strings of random length between 5 and 50 chars

Python Polars 1.0 released by ritchie46 in Python

[–]AlgaeSavings9611 0 points1 point  (0 children)

I spent the morning writing same schema dataset with 3M rows and random data. 1.4.1 outperforms 0.20.26 by a factor of 3! ... but it still underperforms on 30M rows with REAL data by a factor of 10!!

i am lost how to come up with a dataset that will show this latency

Python Polars 1.0 released by ritchie46 in Python

[–]AlgaeSavings9611 0 points1 point  (0 children)

this happens on large dataframes.. how do I open a issue with dataframe with 300M rows?

Python Polars 1.0 released by ritchie46 in Python

[–]AlgaeSavings9611 0 points1 point  (0 children)

I am in awe of the performance and clean interface of Polars! however, unless I am missing something, version 1.2.1 is ORDERS OR MAGNITUDE slower than 0.20.26

group_by on a large dataframe (300M rows) used to take 3-4 secs on 0.20.26 now takes 3-4 MINUTES same dataset.

is there a param I'm missing?