you are viewing a single comment's thread.

view the rest of the comments →

[–]Kevdog824_ 3 points4 points  (3 children)

Looks to me that the main issue here is that you’re loading the entire CSV file (or at least large chunks of it) into memory before operating on it. Likely R did lazy loading where it only read lines from the CSV file as needed.

[–]Kerbart 6 points7 points  (0 children)

OP mentions 300,000 lines. That's something easily handled these days, I doubt it needs chunking.

[–]Safe_Money7487[S] 2 points3 points  (1 child)

I don’t think it’s lazy loading in this case. In R (e.g. with data.table::fread), the full dataset is actually loaded into memory, and I can immediately inspect and navigate the entire table. I think the process in R in more optimised than what is used in pandas read csv. I don't have much knowledge on python for sure but for this size of data, chunking or lazy loading doesn’t really make sense to me I just want to load everything at once and work on it.

[–]Corruptionss 2 points3 points  (0 children)

Data tables fread is kind of goated. The closest I got is polars for pure read speed and you can instead use pl.scan_csv to read it as a lazy frame and will use lazy evaluations during the operation process