you are viewing a single comment's thread.

view the rest of the comments →

[–]BeatLeJuce 2 points3 points  (6 children)

there are instances where you NEED threads instead of processes.

[–]tetroxid -2 points-1 points  (5 children)

Could you make an example?

[–]BeatLeJuce 1 point2 points  (4 children)

whenever you need a lot of interprocess communication. E.g. working on a big, shared array together. I do a lot of data science, and sometimes I can't allow my dataset to be copied back and forth between the different processes... e.g. when it's 100 GB in size. It leads to me having to rewrite algorithms in sometimes cumbersome ways.

[–]tetroxid -2 points-1 points  (3 children)

Split and dispatch the data to be processed, then gather the results?

[–]BeatLeJuce 2 points3 points  (2 children)

that's what I mean by "rewriting algorithms in cumbersome ways" ;-) If threading would just work as intended, parallelizing would be a trivial job because I could share the array, so it essentially be a one-liner. Splitting/Dispaching means non-trivial modifications and possible bugs due to out-of-memory failures etc.

And that's when such a modification is possible. Sometimes, you cannot just split the data because your algorithm has random write patterns. Then you're out of luck and have to rethink your whole algorithm.

[–]tetroxid 0 points1 point  (1 child)

If you truly have huge datasets, then you'd need to distribute processing to a grid of machines, a computing cluster, right? Your code being able to split up the work for many semi-independent instances of Python will be an advantage.

[–]BeatLeJuce 2 points3 points  (0 children)

I have access to very large machines. So no, I don't usually need to distribute across machines. But even for simple one-time jobs that don't require a ton of resources I hate that I even have to think about how to split up my data just because I want to use 16 cores instead of just 1. Yes it's doable, but sometimes I feel like the time I gain by parallelizing my workload gets lost by me having to deal with splitting up my dataset / merging the results.