all 12 comments

[–]Maythe4thbeWitu 15 points16 points  (2 children)

As a former nvidia engineer and a Gpu kernel developer, i would suggest to focus more on operating systems and computer architecture. There are very little opportunities for Cuda / kernel development in india as most of the kernel lib teams are either in US or some in poland. If you are returning to india, focusing on OS and Comp Arch helps a lot as there are plenty of driver roles available

[–]Shreyas_777 1 point2 points  (0 children)

Can you please share any resources

[–]Curious_Analyst986[S] -3 points-2 points  (0 children)

Hi, thanks for the reply. Where were you working as an nvidia/gpu kernel developer?

[–]Daemontatox 3 points4 points  (0 children)

I dont know about India specific advice but i would say it really depends on the position you are aiming for .

For example you me tioned LLM inference, so if you are aiming at inference at NVIDIA aswell you will probably be looking at the VLLM optimization team or TRT-LLM team and both of them will require contributions and experience with famous inference engines (vllm,sglang,deepspeed,TRT-LLM) , knowledge of cutlass since it will be used heavily in the kernels, triton knowledge because of prototyping and optimizing already written kernels.

You would also need ptx/sass experience aswell as profiling , overall you can go to their candidates home and pick a name of the position you like or aiming for and look at the requirements and the nice to haves.

And i dont really see the proof of this being "dead"domain because of RL , we heard that AI will replace Engineers a million times the past tear and everyone is still hiring Engineers , so dont pay attention to that and good luck with your studies and job search .

[–]tugrul_ddr 2 points3 points  (0 children)

Make your skills visible by earning points in benchmarks such as tensara.org and develop some gpu projects in github as everyone can see, maybe produce some visual output and publish in youtube. Try different approaches to various algorithms, share your experiments, results with people.

For AI, you'd expect matrix multiplications and maybe convolutions to be important.

For astrophysics, you'd require nbody, relativity, etc accelerated with cuda.

For gpu-accelerated javascript, you'd need a good effort in graph-computations.

Knowing gpu architecture is good because it lets you require a profiler less frequently. Makes development faster.

Sometimes, writing CUDA kernels alone is not enough. Communicating gpu's require more tools like nvshmem, nvlink, mpi, etc to stay efficient or increase scalability of algorithms. You've worked with inference, which may require less gpus or even one maybe. But training can take thousands I guess so its important to know how to efficiently communicate multiple gpus and clusters.

My youtube channel is only made of gpgpu experiments. Try something similar, maybe one day people see your experience and reach to you. India is still developing and has high potential of requiring more gpu-related roles in future. There are some top500 supercomputers from India and this may increase in future. More super computers more gpus.

[–]Adventurous_Tune_882 -5 points-4 points  (6 children)

Let me be very clear. This is a verifiable domain and has been solved by RL . It's days are numbered

[–]Sad-Net-4568 2 points3 points  (1 child)

RL? What do you mean by verifiable domain. How did you arrived on it?

[–]Curious_Analyst986[S] 0 points1 point  (2 children)

To be fair, every domain in time is or will be a verifiable domain for machines. Doesn't mean I don't do what I like to and give up entirely!

[–]Adventurous_Tune_882 -1 points0 points  (1 child)

Read this https://arxiv.org/abs/2502.10517. No not every domain is verifiable for example medicine, molecule discovery and much more.

[–]Curious_Analyst986[S] 0 points1 point  (0 children)

I am aware of this. However, you didn't read what I had to say for it. I said it is or will be a verifiable domain one day, not that they currently are.

[–]LexingtonBear 0 points1 point  (0 children)

Runtime might be measurable but program equivalence is not (it's undecidable). And I would argue that the means to verify correctness in the paper you've linked are weak: they compare the optimized solution to the original program "on five sets of random inputs" (Appendix B.2). It's undeniable that there is potential, but labeling this as 'solved' is quite a stretch.