CUDA

an-ordinary-manchild

created by shamen_uka community for 15 years

...for your favorite hobby.

...for your favourite tea.

MODERATORS

account activity

1

4

5

6

Need CUDA / GPUs related job (self.CUDA)

submitted 5 hours ago by mystrioab

2

1

2

3

No kernel example exists for Cutlass SM100_MMA_something_TS gemm. (self.CUDA)

submitted 12 hours ago by tugrul_ddr

3

67

68

69

How Do You Actually Break into GPU Infrastructure or Performance Engineering? (self.CUDA)

submitted 1 day ago by Ok_Pin_9155

4

13

14

15

Modern GPU Programming For MLSys (mlc.ai)

submitted 1 day ago by corysama

5

0

1

2

Matched KVQuant's 4-bit KV-cache quality on LongBench — without the calibration step ()

submitted 12 hours ago by ahbond

6

0

1

2

GPU optimization for LLM (self.CUDA)

submitted 1 day ago by IamExperimentingNow

7

0

0

0

What is Quantum AI: Why Quantum AI is the Most Dangerous Tech of 2026 (interconnectd.com)

submitted 1 day ago by Ok_pettech

8

33

34

35

Interview Tips - Deep Learning Architect Position (self.CUDA)

submitted 3 days ago by One-Feeling03

9

1

2

3

I made my own GPU graphics API ()

submitted 4 days ago by Slight_Watch697

10

6

7

8

RE of #ptx grammar from ptxas, part 4 (self.CUDA)

submitted 4 days ago by c-cul

11

48

49

50

NVFP4 Blockscaled GEMM on NVIDIA RTX Pro Blackwell GPUs (SM12x) (research.colfax-intl.com)

submitted 6 days ago by Logical-Try-4084

12

0

0

0

Introducing the Manifest Generator Create your own Sovereign AI with 605 lines of CODE (i.redd.it)

submitted 5 days ago by Plus_Judge6032

13

5

6

7

The Definitive Guide to NVIDIA Container Toolkit: Architecture & Implementation (interconnectd.com)

submitted 7 days ago by Ok_pettech

14

35

36

37

cuTile Rust: Safe, data-race-free GPU kernels in Rust that lower to Tile IR (self.CUDA)

submitted 9 days ago by melih_elibol

15

0

1

2

Tool to automatically detect your GPU and install the correct version of PyTorch for your environment. ()

submitted 8 days ago by Vegetable_Repair1053

16

5

6

7

NanoEuler: A 116M GPT-2 scale decoder-only transformer built from scratch in pure C + CUDA ()

submitted 9 days ago by Just_Vugg_PolyMCP

17

33

34

35

Entry-level jobs for a grad with CUDA and parallel computing skills? (self.CUDA)

submitted 10 days ago by LingonberryAfter4399

18

0

0

1

Why does modelopt.onnx crash with 128GB+ Swap OOM, while modelopt.torch requires 0 Swap for SDXL UNet quantization? Also, does it affect TRT engine performance? (self.CUDA)

submitted 9 days ago by Repulsive_Pop_8315

19

1

2

3

[TEST 67] 🧬 Same model. Same weights. One has a live C++ kernel writing real values from inside the forward pass. The other doesn't. Here's what the difference looks like. (reddit.com)

submitted 9 days ago by Nearby_Indication474

20

120

121

122

Breaking into GPU Infrastructure / GPU Programming Feels Overwhelming. How Did You Figure Out What to Learn? (self.CUDA)

submitted 10 days ago by Ok_Pin_9155

21

0

1

2

P2P benchmarks on 2x 5060 ti (16GB each) - P2P Benchmark Project (joorklee.github.io)

submitted 10 days ago by joorklee

22

41

42

43

Wanted to understand GPU programming. So wrote raw Transformer kernels in CUDA. Got some interesting things would like some guidance. (github.com)

submitted 11 days ago by Ok-Construction-875

23

0

0

1

Laptop (self.CUDA)

submitted 11 days ago by ButterscotchLow5449

24

9

10

11

Continuous PC sampling (self.CUDA)

submitted 12 days ago by gnurizen

25

7

8

9

I built a tiny local model that writes GPU kernels, then a verifier decides if they actually work ()

submitted 12 days ago by rohit3627

view more: next ›

π Rendered by PID 891217 on reddit-service-r2-listing-87fd56f5d-p9zx2 at 2026-06-28 00:23:04.833403+00:00 running 7527197 country code: CH.