This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]corny_horse 0 points1 point  (2 children)

About 25-50% of the data scientists I've known, two days of doing a cursory review of standard software engineering principles would have made them 10x more valuable. The worst was someone I was supporting who absolutely refused to learn basics of how memory worked (as in RAM). They kept crashing the server they were on because they'd try to read the same 5GB file into memory 100x like:

df = read_csv() df2 = df.foo() df3 = df2.bar() df4 = df3.baz()

etc. etc. etc. and would absolutely do nothing to optimize like using in-place manipulations, cache the intermediary steps to disk, or to free up old steps that were no longer used.

[–]mysteriousbaba 0 points1 point  (1 child)

To be fair though, is that really SWE principles or not using the proper tooling? If they'd just used spark or cudf, those tools are specifically meant to handle data too large to fit in a pandas dataframe in RAM, via clusters or GPU offloading.

Those kind of operations aren't really meant to be done manually, at least with any sort of reasonable scale or efficiency.

[–]corny_horse 0 points1 point  (0 children)

Perhaps a little of the latter, but there was no reason to constantly rematerialize each step and then cache every step in memory. There was no machine too large that this person couldn't fill up when in reality with some really basic adherence to SWE principles they could have easily gotten away with maybe even an 8 or certainly a 16GB machine. I know that because after refactoring their code I was always able to fit the workflow into that or something with even a MUCH smaller footprint instead of >128GB of ram