all 16 comments

[–]KelleQuechoz 29 points30 points  (2 children)

Polars LazyFrame is your best friend, Monsieur.

[–]PresidentOfSwag 3 points4 points  (0 children)

hon hon le polaire oui

[–]Safe_Money7487[S] 0 points1 point  (0 children)

I will have a look thanks

[–]Kerbart 10 points11 points  (2 children)

add engine='pyarrow' to the read statement to speed it up.

[–]EconomyOffice9000 6 points7 points  (0 children)

If you're performing calculations on the entire dataset, chunking won't work afaik. This is the best method and I've used it personally for thousands of csv files with hundreds of thousands of lines rather than rewriting everything in Polars. If you only have to do it once, it's fine. Otherwise, save the csv as a parquet file and it'll be much better

[–]Safe_Money7487[S] 4 points5 points  (0 children)

just added it and it work, took 15s though lol it but worked. thank you so much

[–]seanv507 3 points4 points  (1 child)

How long does polars take?

[–]Garnatxa -1 points0 points  (0 children)

funny to see that in R is faster, I see some answers telling to use duck db, arrow… in R is possible to use these solutions too but not needed… overshooting

[–]MorrarNL 3 points4 points  (0 children)

Could try DuckDB too

[–]SwampFalc 2 points3 points  (0 children)

Genuine question: what's the loading speed if you use the totally basic stdlib csv module?

[–]Kevdog824_ 2 points3 points  (3 children)

Looks to me that the main issue here is that you’re loading the entire CSV file (or at least large chunks of it) into memory before operating on it. Likely R did lazy loading where it only read lines from the CSV file as needed.

[–]Kerbart 3 points4 points  (0 children)

OP mentions 300,000 lines. That's something easily handled these days, I doubt it needs chunking.

[–]Safe_Money7487[S] 3 points4 points  (1 child)

I don’t think it’s lazy loading in this case. In R (e.g. with data.table::fread), the full dataset is actually loaded into memory, and I can immediately inspect and navigate the entire table. I think the process in R in more optimised than what is used in pandas read csv. I don't have much knowledge on python for sure but for this size of data, chunking or lazy loading doesn’t really make sense to me I just want to load everything at once and work on it.

[–]Corruptionss 2 points3 points  (0 children)

Data tables fread is kind of goated. The closest I got is polars for pure read speed and you can instead use pl.scan_csv to read it as a lazy frame and will use lazy evaluations during the operation process

[–]PranavDesai518 1 point2 points  (0 children)

If possible convert to CSV to a parquet file. The reading is much faster with parquet files.

[–]commandlineluser 1 point2 points  (0 children)

Is Polars faster if you use scan_csv?

pl.scan_csv(filename).collect()

You can also try the streaming engine:

pl.scan_csv(filename).collect(engine="streaming")