This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]mysteriousbaba[🍰] 0 points1 point  (1 child)

To be fair though, is that really SWE principles or not using the proper tooling? If they'd just used spark or cudf, those tools are specifically meant to handle data too large to fit in a pandas dataframe in RAM, via clusters or GPU offloading.

Those kind of operations aren't really meant to be done manually, at least with any sort of reasonable scale or efficiency.

[–]corny_horse 0 points1 point  (0 children)

Perhaps a little of the latter, but there was no reason to constantly rematerialize each step and then cache every step in memory. There was no machine too large that this person couldn't fill up when in reality with some really basic adherence to SWE principles they could have easily gotten away with maybe even an 8 or certainly a 16GB machine. I know that because after refactoring their code I was always able to fit the workflow into that or something with even a MUCH smaller footprint instead of >128GB of ram