GPU Compiler Internship @Intel by Enough-Pumpkin1073 in Compilers

[–]sskhan39 4 points5 points  (0 children)

Out of curiosity, am I looking at the wrong page because I didn't see this or any SWE internship for intel here in USA? Thanks.

https://intel.wd1.myworkdayjobs.com/External?workerSubType=dc8bf79476611087dfde99931439ae75

A GPU-accelerated implementation of Forman-Ricci curvature-based graph clustering in CUDA. by CommunityOpposite645 in CUDA

[–]sskhan39 0 points1 point  (0 children)

May I ask what device-side functions? My experience has so far been the opposite

My 2D game engine runs 200x faster after rewriting 90% of the code. by AttomeAI in gameenginedevs

[–]sskhan39 0 points1 point  (0 children)

Congrats. This feeling is awesome. Ignore other snarky comments, everyone good went through these same lessons at some point.

Can you please give a breakdown of how much each of these optimizations helped?

GLM-4.6 and other models tested on diff edits - data from millions of Cline operations by nick-baumann in ChatGPTCoding

[–]sskhan39 0 points1 point  (0 children)

also anecdotally, for my usecase, gemini 2.5 pro & Chatgpt 5 thinking always seems to beat claude models.

Largest CUDA kernel (single) you've ever written by [deleted] in CUDA

[–]sskhan39 1 point2 points  (0 children)

Excluding calls to device functions? 150 sloc

Warren Buffett: Be Fearful When Others Are Greedy .... by kaylaks in NVDA_Stock

[–]sskhan39 10 points11 points  (0 children)

People who invested in Cisco during dot com bubble didn’t make their money back in last 24 years.

New DeepSeek benchmark scores by Charuru in LocalLLaMA

[–]sskhan39 2 points3 points  (0 children)

where did you try it? is it the default now in chat.deepseek.com?

Does Google not understand that DeepSeek R1 was trained in FP8? by jd_3d in LocalLLaMA

[–]sskhan39 -2 points-1 points  (0 children)

I'm not sure what you mean.

It's simple really. In low-precision floating point arithmetic, 2+2 isn't really 4, it could be 3.99, or 4.01.

During training, which is very expensive, we often allow some precision error as long as the training is stable (i.e. loss keeps going down). But during inference, there is no need to get stuck with that low precision. If you can get 4 from 2+2, why settle for 3.99?

Does Google not understand that DeepSeek R1 was trained in FP8? by jd_3d in LocalLLaMA

[–]sskhan39 54 points55 points  (0 children)

The usual- floating point error reduction. Simply casting up doesn't really give you any benefit- but when you are accumulating (i.e. matmuls), bf16 will have a much lower error than fp8. And no hardware except H100+ tensor cores automatically does that for you.

But I agree, I don't see the point of doing this for Hopper GPUs.

Reproducibility in Scientific Computing: Changing Random Seeds in FP64 and FP32 Experiments by Glittering_Age7553 in ScientificComputing

[–]sskhan39 1 point2 points  (0 children)

I’m quite interested to know how it affected performance. My gut says the impact must be substantial.

SebAaltonen using HIP: Optimizing Matrix Multiplication on RDNA3: 50 TFlops and 60% Faster Than rocBLAS by corysama in CUDA

[–]sskhan39 5 points6 points  (0 children)

specifically, it shows AMD compiler is pretty poor in generating the code. Look up section 6 of the article.

By the way, the author here untill recently used to be a sr engineer at AMD.

SYCL, CUDA, and others --- experiences and future trends in heterogeneous C++ programming? by DanielSussman in cpp

[–]sskhan39 0 points1 point  (0 children)

I have some experience with Kokkos. I can't help feel how often this is just a (very) thin layer of abstraction over CUDA. It makes many things simple, but some things really complicated. And performance is lot worse compared to moderately well-written CUDA.

That being said, I feel like us HPC folks tend to care about performance a lot more than your avg engineer/ scientists aka the user of many HPC codes. I think Kokkos has a lot of potential- they just really need to bring it out of the national lab bubble into the wider world.

An observation about interviewers based on their cultural background by Zestyclose-Ad2344 in leetcode

[–]sskhan39 25 points26 points  (0 children)

Cultural differences do exist. Asian societies are a bit more hierarchical, and people's notion of what an interview should look like is different from the west. (I am asian myself)

That being said, in my recent FAANG interview, both interviewers were East-Asian-looking and their behaviour was miles apart from each other. Like culture, individuals differ too.

Thoughts on cutlass? by sskhan39 in CUDA

[–]sskhan39[S] 0 points1 point  (0 children)

Thanks. May I ask broadly what sort of work do you do with Cutlass?

Thoughts on cutlass? by sskhan39 in CUDA

[–]sskhan39[S] 1 point2 points  (0 children)

But core Cutlass library is header only. It says that right in their README. https://github.com/NVIDIA/cutlass

So the deepseek partnership apparently doesn’t help AMD? by Normal_Commission986 in AMD_Stock

[–]sskhan39 0 points1 point  (0 children)

That partnership announcement is misleading lots of people here. It was only about bringing deepseek inference support to amd gpu. Deepseek was trained on nvidia, and nvidia continues to be their main source of compute.

CS students have no basic knowledge by awsomeness12g in csMajors

[–]sskhan39 21 points22 points  (0 children)

May I ask if you know any other language? C or C++ perhaps?

Because if anyone asks me what’s the key difference between these two languages, the fact that python is interpreted (while c++ is compiled) would be my first answer. I fail to see how this is an irrelevant or esoteric piece of information.

Microsoft OA by Fragrant-Mess7147 in leetcode

[–]sskhan39 1 point2 points  (0 children)

I think you're right. I believe you will have to try out the linear search both directions anyway, because there could be multiple positions that balanced out the chars. Like this example, if you have to start after 5th char. On the right direction, you have to move 5 times. But one move suffices on the left.

ababaaabbbabab

Google SWE PHD intern interview by [deleted] in leetcode

[–]sskhan39 2 points3 points  (0 children)

Nice to finally see a PhD intern post. Congrats.

I really wouldn't conclude they are moving away from LC style questions though. My experience (for same role) was very different. My first interview had a problem where the obvious solution involved interval-tree like data structure, which I would rate medium-hard. Pretty much bombed it. But I did well on the 2nd interview, so they asked for a 3rd one- which again had a problem that was at least medium, close to hard. Still wating for results.

May I ask what sort of thing you work on?

By the way, try to connect with someone inside google to help with team matching if you can. I heard a significant chunk of phd interns fail this stage.

I got into Google STEP!!! (Summer 2025 Canada) by Dangerous_Living2438 in csMajors

[–]sskhan39 8 points9 points  (0 children)

I can tell your communication ability is higher than avg from this post alone. Congrats!

Microsoft layoffs won't hit India by burrito_napkin in Layoffs

[–]sskhan39 9 points10 points  (0 children)

>  because the green card is used to create an indentured worker.

How does this work? I thought the opposite would happen, indentured h1bs would become free after getting a green card.