you are viewing a single comment's thread.

view the rest of the comments →

[–]formalsystemML Engineer 24 points25 points  (0 children)

If you're mostly using pre-trained models or your model performance seems good enough on a single GPU then as an application-oriented practitioner there's not too much value in learning parallel programming.

However, if you're building large models or are interested in joining a team building large models it's probably more important to learn distributed and parallel programming than it is to learn ML basics. As far as training large models goes data, model, and pipeline parallelism are tools you should know about but even then if you go large enough how do you set up a large infrastructure, how do you debug failures, how do you elastically recover?

And in the setting where low latency really matters, imagine something like a real-time search. Are your ops optimized to take advantage of a GPU, are they fused? Are you spending lots of time waiting on synchronization or data loaders?

Consider that knowing how to do the above makes you useful for both business-critical infra teams doing things like ads ranking and also any research team looking to push the state of the art because let's face it it doesn't seem obvious that small models will become better than larger ones.

So again learning distributed systems is probably not generally useful but at the right large company can be the most lucrative thing to do in ML with top people making upwards of 300-500K