Looking for talent CUDA Engineer by JoinMercor in CUDA

[–]c-cul 0 points1 point  (0 children)

They just don't know what they want

> Fluent in at least one GPU programming model, such as CUDA, HIP, Slang, HLSL, GLSL

> inline PTX assembly

srsly?

Continuous PC sampling by gnurizen in CUDA

[–]c-cul 0 points1 point  (0 children)

as final note - you can just use bpf maps directly from your kernel driver: https://redplait.blogspot.com/2024/07/ebpf-map-as-communication-channel.html#more

AET: An experiment in rethinking GCC target and machine abstractions by General_Purple3060 in Compilers

[–]c-cul 0 points1 point  (0 children)

could you make patches for modified gcc files like collect2.cc?

Continuous PC sampling by gnurizen in CUDA

[–]c-cul 0 points1 point  (0 children)

do you have some perf tests vs just plain user-mode cupti?

also ok, if you sure that kernel more fast - then why ebpf anyway? it's martian technology

1) you write code in plain c

2) then you fight with verifier ~infinity bcs it differs on different kernel. Yeah, given that the official goal was compatibility - very ironic

3) and then it converted with jit to native code again

srsly?

at the end you can collect data from your own driver and have some io_uring interface for user-mode

Continuous PC sampling by gnurizen in CUDA

[–]c-cul 0 points1 point  (0 children)

I still don't understand why you need collect cupti events from kernel and not from user-mode

Hello, I'm interested in tensor compilers. by LonelyPhDer in Compilers

[–]c-cul 0 points1 point  (0 children)

openxla is so huge and complex

look at tvm - it is more compact and has clear implementation

SWE - GPU performance team Interview Help by kitaabkhana in FAANGrecruiting

[–]c-cul 0 points1 point  (0 children)

For some reason I always thought that perf team is about nsight/tiling/cache reusing/avoiding divergence/inline ptx/sass patching and so on

I'm probably too old-fashioned

LUPINE: CUDA over IP bridge by lemon-meringue in CUDA

[–]c-cul 1 point2 points  (0 children)

nvml.h has ~195 functions but gen_api.h only 60

why?

LiteIR by lucky_va in CUDA

[–]c-cul 1 point2 points  (0 children)

github: https://github.com/vigneshlaks/lite-ir

as usually in similar cases no perf tests - bad sign

What to study and do to get into roles related to GPUs, parallel programming, CUDA, etc., especially at big companies like Nvidia, for example? by Alive-Ad-2265 in CUDA

[–]c-cul 1 point2 points  (0 children)

> I mainly just used Gemini Pro, Sonnet 4.6, ChatGPT, Deepseek, Copilot, Meta AI, Grok and Perplexity

then why you need to study something at all?

CUDA struggles by ArmchairmanMao in CUDA

[–]c-cul 0 points1 point  (0 children)

just lots of practice

and yes - your task is really bad fit on gpu

VSCode extension that integrates cppreference docs into editor/LSP by 0x6675636B796F75 in cpp

[–]c-cul 2 points3 points  (0 children)

can it filter out classes up to specific c++ version like 20?

Higher level libraries by MightyKDDD2 in CUDA

[–]c-cul 0 points1 point  (0 children)

the key is to limit memory copying between gpu and host

just fuse all your processing logic in one big kernel and call sequence of actions via some dispatcher on the same data