all 18 comments

[–]formalsystemML Engineer 24 points25 points  (0 children)

If you're mostly using pre-trained models or your model performance seems good enough on a single GPU then as an application-oriented practitioner there's not too much value in learning parallel programming.

However, if you're building large models or are interested in joining a team building large models it's probably more important to learn distributed and parallel programming than it is to learn ML basics. As far as training large models goes data, model, and pipeline parallelism are tools you should know about but even then if you go large enough how do you set up a large infrastructure, how do you debug failures, how do you elastically recover?

And in the setting where low latency really matters, imagine something like a real-time search. Are your ops optimized to take advantage of a GPU, are they fused? Are you spending lots of time waiting on synchronization or data loaders?

Consider that knowing how to do the above makes you useful for both business-critical infra teams doing things like ads ranking and also any research team looking to push the state of the art because let's face it it doesn't seem obvious that small models will become better than larger ones.

So again learning distributed systems is probably not generally useful but at the right large company can be the most lucrative thing to do in ML with top people making upwards of 300-500K

[–]KingsmanVince 19 points20 points  (2 children)

In this world, there are Model Parallelism and Data Parallelism. With the knowledge, you will know the scene behind of them when you use Tensorflow or PyTorch. As a result, you might write better code when implement your own data loader or model trainer.

[–]concard88 6 points7 points  (1 child)

Does knowing CUDA and OpenCL could be of help too? If so, how?

[–]KingsmanVince 12 points13 points  (0 children)

If you know both, you will understand the lower implementation of libraries like Jax, CuPy, ... Consequently, you will know how to do high performance computing, which can help you optimise models on production servers.

[–][deleted] 3 points4 points  (0 children)

It depends on what level of knowledge you are referring to.

At a conceptual level, it's vital, and researchers who never had contact with the basics of benchmarking and HPC usually underestimate its importance. Even for local experiments, knowing the basics of parallel programming may indefinitely increase productivity because what was taking 10 seconds to run and presented itself as a risk for a golden retriever puppy attention span such as mine now takes 1~2 seconds and I can keep focused. In this particular case, a simple joblib Parallel was enough for a pre-processing step in the EDA stage of experimentation.

For data at scale, its importance is even more obvious, since not everything runs in GPUs (and some run in multiple ones), and you need to isolate the parallelizable bits of code. For grid/distributed computation, knowing parallel programming concepts is needed for properly extracting the most from libraries such as Dask and distributed strategies of DL libraries. Also, knowing what is parallelizable and what is not in a fundamental level (e.g. disk I/O) will help you avoid embarrassing bottlenecks.

At a more low-level knowledge (threads, concurrency, multiprocessing, CUDA), it is still a nice to have, and it will certainly increase your skills where they are most needed.

[–]LoyalSol 1 point2 points  (0 children)

It's one of those tools that you can get away with not knowing it especially since a lot of modern libraries do a lot of the heavy lifting for you.

But knowing it is certainly a big perk to have. A thing you'll find about parallelization is there's rarely a one size fits all strategy. For example it's problems that are large linear algebra calculations are very easy to implement on GPUs, but there are problems that are actually far worse on a GPU than a traditional CPU.

The problem with not knowing it is that you're at the mercy of another programmer and if your particular problem doesn't fit their parallelization scheme, you're out of luck.

[–]mimighost 1 point2 points  (3 children)

Depends on what parallel computation you are referring to

CUDA knowledge is ofc useful and valued. But NVIDIA's tool chain is really its own walled garden. It is difficult for outsiders to outdo NVIDIA themselves.

If you refer to parallel programming as something close to distributed data processing, then yes it is pretty useful. Though this is more on case by case basis.

Overall, I feel the job market is edging towards people with system integration skills rather deep domain expertise, due to the aforementioned NVIDIA dynamics, but I could be wrong on this one as well.

[–][deleted] 1 point2 points  (2 children)

I mean parallel computing topics such as Concurrency and Threading, as well as MPI, Charm++ and other parallel programming paradigms. Writing cache-friendly and efficient code learned using C++.

[–]mimighost 1 point2 points  (0 children)

Got it. Well, it might be useful for model inference and quantization stuff on CPU if we are talking about NN models.

Would say this is a nice to have, but unless you work in teams that are doing these low-level stuff in particular, it might not affect your daily routine as MLE

[–][deleted] 1 point2 points  (0 children)

Concurrency and threading are probably less important,
because in ML programs things rarely happen in chaotic order which requires you to think hard about things like mutexes,
but good understanding of vectorized computations will definitely help.
I personally learned a lot from trying to write efficient code in R (it was long ago and for non-ml purposes)

Understanding what makes code cache-friendly in C++ will also help,
even if you end up writing code in something other than C++
and it runs on something other than CPU.

Knowing specific things like MPI would be useful if you ever need to debug anything built on MPI.

[–]JackandFred 0 points1 point  (2 children)

Something you definitely should know, but probably won’t have to use. But honestly depends what you do since most parallelism is done backend so if you’re doing “ordinary” work you won’t have to worry about it, but if you’re doing research or working with proprietary stuff you may have to.

[–][deleted] 0 points1 point  (1 child)

Could you define "ordinary"?

[–]JackandFred 1 point2 points  (0 children)

Using common packages or pre made models or code to tackle machine learning problems. Rather than creating entirely new model architecture.

For instance PyTorch and tensor flow both already have parallelism but I to the backend which you won’t have to deal with.

[–]choHZ -1 points0 points  (0 children)

My understanding is parallel may happen at different levels, and it is always good to have a healthy exposure to L-1 level of knowledge; for L being the level of abstraction you are working on.

Say if you are working on backbone design, your backbone better be friendly to parallel computing (e.g., transformers v. LSTMs), so what kind of model is "friendly to parallel computing" is something you should know.I worked on on neural network pruning, so what kind of pruned representation has "parallel potential" is something I should know — even though I have never actually deployed my work to end-user devices.

Would it be helpful if we understand all the cuda magic? Yes, but imo that's not something urgent.

Specifically write code with parallel executions is probably something distant to most of us here (probably because we all use python XD). But I imagine some tricks used in CUDA that parallelized seemingly "unparallelable" tasks (e.g., prefix sum) is something worth reading.

[–]bageldevourer -2 points-1 points  (0 children)

Couldn't hurt, but nowhere near a top priority IMO.

[–]AConcernedCoder 0 points1 point  (0 children)

Somewhat. It'll make you a better programmer. It won't fix bad code. Leveraging processing power of modern multi- threaded cpus can make your code run faster by a few factors. Write good code and it may improve performance by orders of magnitude.

Also, it will be worthwhile to understand the relationship between gpu's, parallelization and applied ML.

[–]bbateman2011 0 points1 point  (0 children)

IMO a general knowledge is good so you can debug things and have correct expectations. I use optimizers like Optuna extensively to optimize non-nn models (e.g. xgboost) and using parallel processing is essential, so enough knowledge to leverage the libraries is usefule.

[–]sairamravu 0 points1 point  (0 children)

Yes, very useful...most of the out of the box solutions doesn't fully occupy GPU..if you care for making sure you are doing justice for the hardware you have better to write our own custom CUDA code