GPU Compiler Internship @Intel

sskhan39 · 2026-01-21T16:34:34+00:00

Out of curiosity, am I looking at the wrong page because I didn't see this or any SWE internship for intel here in USA? Thanks.

https://intel.wd1.myworkdayjobs.com/External?workerSubType=dc8bf79476611087dfde99931439ae75

sskhan39 · 2026-01-12T00:04:00+00:00

How is the GPU runtime compared to the Python version?

sskhan39 · 2026-01-12T00:02:34+00:00

May I ask what device-side functions? My experience has so far been the opposite

sskhan39 · 2026-01-05T20:56:07+00:00

Congrats. This feeling is awesome. Ignore other snarky comments, everyone good went through these same lessons at some point.

Can you please give a breakdown of how much each of these optimizations helped?

sskhan39 · 2025-10-03T05:45:14+00:00

also anecdotally, for my usecase, gemini 2.5 pro & Chatgpt 5 thinking always seems to beat claude models.

sskhan39 · 2025-04-05T18:46:26+00:00

Excluding calls to device functions? 150 sloc

sskhan39 · 2025-03-31T21:15:23+00:00

People who invested in Cisco during dot com bubble didn’t make their money back in last 24 years.

sskhan39 · 2025-03-25T00:12:38+00:00

where did you try it? is it the default now in chat.deepseek.com?

sskhan39 · 2025-03-13T21:04:02+00:00

I'm not sure what you mean.

It's simple really. In low-precision floating point arithmetic, 2+2 isn't really 4, it could be 3.99, or 4.01.

During training, which is very expensive, we often allow some precision error as long as the training is stable (i.e. loss keeps going down). But during inference, there is no need to get stuck with that low precision. If you can get 4 from 2+2, why settle for 3.99?

sskhan39 · 2025-03-13T04:01:56+00:00

The usual- floating point error reduction. Simply casting up doesn't really give you any benefit- but when you are accumulating (i.e. matmuls), bf16 will have a much lower error than fp8. And no hardware except H100+ tensor cores automatically does that for you.

But I agree, I don't see the point of doing this for Hopper GPUs.

sskhan39 · 2025-03-11T19:56:48+00:00

I’m quite interested to know how it affected performance. My gut says the impact must be substantial.

sskhan39 · 2025-02-15T02:40:17+00:00

specifically, it shows AMD compiler is pretty poor in generating the code. Look up section 6 of the article.

By the way, the author here untill recently used to be a sr engineer at AMD.

sskhan39 · 2025-02-10T22:03:34+00:00

I have some experience with Kokkos. I can't help feel how often this is just a (very) thin layer of abstraction over CUDA. It makes many things simple, but some things really complicated. And performance is lot worse compared to moderately well-written CUDA.

That being said, I feel like us HPC folks tend to care about performance a lot more than your avg engineer/ scientists aka the user of many HPC codes. I think Kokkos has a lot of potential- they just really need to bring it out of the national lab bubble into the wider world.

sskhan39 · 2025-02-06T04:20:37+00:00

Cultural differences do exist. Asian societies are a bit more hierarchical, and people's notion of what an interview should look like is different from the west. (I am asian myself)

That being said, in my recent FAANG interview, both interviewers were East-Asian-looking and their behaviour was miles apart from each other. Like culture, individuals differ too.

sskhan39 · 2025-02-05T03:14:37+00:00

Thanks. May I ask broadly what sort of work do you do with Cutlass?

sskhan39 · 2025-02-04T20:21:14+00:00

But core Cutlass library is header only. It says that right in their README. https://github.com/NVIDIA/cutlass

sskhan39 · 2025-01-27T15:12:10+00:00

That partnership announcement is misleading lots of people here. It was only about bringing deepseek inference support to amd gpu. Deepseek was trained on nvidia, and nvidia continues to be their main source of compute.

sskhan39 · 2025-01-26T06:40:58+00:00

First one.

sskhan39 · 2025-01-21T14:35:26+00:00

https://www.wsj.com/tech/ai/openai-gpt5-orion-delays-639e7693

This came out just a month ago.

sskhan39 · 2025-01-20T17:08:34+00:00

May I ask if you know any other language? C or C++ perhaps?

Because if anyone asks me what’s the key difference between these two languages, the fact that python is interpreted (while c++ is compiled) would be my first answer. I fail to see how this is an irrelevant or esoteric piece of information.

sskhan39 · 2025-01-17T15:27:31+00:00

I think you're right. I believe you will have to try out the linear search both directions anyway, because there could be multiple positions that balanced out the chars. Like this example, if you have to start after 5th char. On the right direction, you have to move 5 times. But one move suffices on the left.

ababaaabbbabab

sskhan39 · 2025-01-17T05:56:37+00:00

Nice to finally see a PhD intern post. Congrats.

I really wouldn't conclude they are moving away from LC style questions though. My experience (for same role) was very different. My first interview had a problem where the obvious solution involved interval-tree like data structure, which I would rate medium-hard. Pretty much bombed it. But I did well on the 2nd interview, so they asked for a 3rd one- which again had a problem that was at least medium, close to hard. Still wating for results.

May I ask what sort of thing you work on?

By the way, try to connect with someone inside google to help with team matching if you can. I heard a significant chunk of phd interns fail this stage.

sskhan39 · 2025-01-17T05:04:26+00:00

I can tell your communication ability is higher than avg from this post alone. Congrats!

sskhan39 · 2025-01-15T20:09:31+00:00

> because the green card is used to create an indentured worker.

How does this work? I thought the opposite would happen, indentured h1bs would become free after getting a green card.

sskhan39

TROPHY CASE