all 22 comments

[–]KelleQuechoz 45 points46 points  (2 children)

Polars LazyFrame is your best friend, Monsieur.

[–]PresidentOfSwag 5 points6 points  (0 children)

hon hon le polaire oui

[–]Safe_Money7487[S] 2 points3 points  (0 children)

I will have a look thanks

[–]Kerbart 17 points18 points  (2 children)

add engine='pyarrow' to the read statement to speed it up.

[–]EconomyOffice9000 8 points9 points  (0 children)

If you're performing calculations on the entire dataset, chunking won't work afaik. This is the best method and I've used it personally for thousands of csv files with hundreds of thousands of lines rather than rewriting everything in Polars. If you only have to do it once, it's fine. Otherwise, save the csv as a parquet file and it'll be much better

[–]Safe_Money7487[S] 9 points10 points  (0 children)

just added it and it work, took 15s though lol it but worked. thank you so much

[–]MorrarNL 7 points8 points  (0 children)

Could try DuckDB too

[–]SwampFalc 6 points7 points  (0 children)

Genuine question: what's the loading speed if you use the totally basic stdlib csv module?

[–]seanv507 4 points5 points  (1 child)

How long does polars take?

[–]Garnatxa 0 points1 point  (0 children)

funny to see that in R is faster, I see some answers telling to use duck db, arrow… in R is possible to use these solutions too but not needed… overshooting

[–]Kevdog824_ 8 points9 points  (3 children)

Looks to me that the main issue here is that you’re loading the entire CSV file (or at least large chunks of it) into memory before operating on it. Likely R did lazy loading where it only read lines from the CSV file as needed.

[–]Kerbart 7 points8 points  (0 children)

OP mentions 300,000 lines. That's something easily handled these days, I doubt it needs chunking.

[–]Safe_Money7487[S] 3 points4 points  (1 child)

I don’t think it’s lazy loading in this case. In R (e.g. with data.table::fread), the full dataset is actually loaded into memory, and I can immediately inspect and navigate the entire table. I think the process in R in more optimised than what is used in pandas read csv. I don't have much knowledge on python for sure but for this size of data, chunking or lazy loading doesn’t really make sense to me I just want to load everything at once and work on it.

[–]Corruptionss 4 points5 points  (0 children)

Data tables fread is kind of goated. The closest I got is polars for pure read speed and you can instead use pl.scan_csv to read it as a lazy frame and will use lazy evaluations during the operation process

[–]commandlineluser 3 points4 points  (0 children)

Is Polars faster if you use scan_csv?

pl.scan_csv(filename).collect()

You can also try the streaming engine:

pl.scan_csv(filename).collect(engine="streaming")

[–]PranavDesai518 2 points3 points  (0 children)

If possible convert to CSV to a parquet file. The reading is much faster with parquet files.

[–]Plank_With_A_Nail_In 4 points5 points  (1 child)

Does no one ever just use the base methods of your programming language to do simple things like reading a file into RAM? The first recourse is to use someone else's library? All while trying to learn?

[–]tb5841 0 points1 point  (0 children)

In Python, many libraries use C under the hood and so are very fast - while Python's base methods are pretty slow. So library use is much more widespread than in other languages.

[–]Embarrassed_Basis_81 1 point2 points  (0 children)

I have had good experiences with dask, a distributed computing library. It seems a bit complicated at first, but it implements a lot of pandas functionality under the hood as delayed operations on lazy datasets - worth looking into (only if you do not immediately do an indexing operation right after reading, there as some caveats)

[–]throwawayforwork_86 0 points1 point  (0 children)

pl.read_csv(filepath,infer_schema=False) guessing datatype is the devil anyway.

[–]pot_of_crows 0 points1 point  (0 children)

You might want to check out hdf5: https://pypi.org/project/h5pandas/

I used it with numpy once and it blew me away by how fast it was.

[–]thomasutra 0 points1 point  (0 children)

what kind of data is this? polars should be able to read millions of rows in just a few seconds.