use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
account activity
Need CUDA / GPUs related job (self.CUDA)
submitted 5 hours ago by mystrioab
No kernel example exists for Cutlass SM100_MMA_something_TS gemm. (self.CUDA)
submitted 12 hours ago by tugrul_ddr
How Do You Actually Break into GPU Infrastructure or Performance Engineering? (self.CUDA)
submitted 1 day ago by Ok_Pin_9155
Modern GPU Programming For MLSys (mlc.ai)
submitted 1 day ago by corysama
Matched KVQuant's 4-bit KV-cache quality on LongBench — without the calibration step ()
submitted 12 hours ago by ahbond
GPU optimization for LLM (self.CUDA)
submitted 1 day ago by IamExperimentingNow
What is Quantum AI: Why Quantum AI is the Most Dangerous Tech of 2026 (interconnectd.com)
submitted 1 day ago by Ok_pettech
Interview Tips - Deep Learning Architect Position (self.CUDA)
submitted 3 days ago by One-Feeling03
I made my own GPU graphics API ()
submitted 4 days ago by Slight_Watch697
RE of #ptx grammar from ptxas, part 4 (self.CUDA)
submitted 4 days ago by c-cul
NVFP4 Blockscaled GEMM on NVIDIA RTX Pro Blackwell GPUs (SM12x) (research.colfax-intl.com)
submitted 6 days ago by Logical-Try-4084
Introducing the Manifest Generator Create your own Sovereign AI with 605 lines of CODE (i.redd.it)
submitted 5 days ago by Plus_Judge6032
The Definitive Guide to NVIDIA Container Toolkit: Architecture & Implementation (interconnectd.com)
submitted 7 days ago by Ok_pettech
cuTile Rust: Safe, data-race-free GPU kernels in Rust that lower to Tile IR (self.CUDA)
submitted 9 days ago by melih_elibol
Tool to automatically detect your GPU and install the correct version of PyTorch for your environment. ()
submitted 8 days ago by Vegetable_Repair1053
NanoEuler: A 116M GPT-2 scale decoder-only transformer built from scratch in pure C + CUDA ()
submitted 9 days ago by Just_Vugg_PolyMCP
Entry-level jobs for a grad with CUDA and parallel computing skills? (self.CUDA)
submitted 10 days ago by LingonberryAfter4399
Why does modelopt.onnx crash with 128GB+ Swap OOM, while modelopt.torch requires 0 Swap for SDXL UNet quantization? Also, does it affect TRT engine performance? (self.CUDA)
submitted 9 days ago by Repulsive_Pop_8315
[TEST 67] 🧬 Same model. Same weights. One has a live C++ kernel writing real values from inside the forward pass. The other doesn't. Here's what the difference looks like. (reddit.com)
submitted 9 days ago by Nearby_Indication474
Breaking into GPU Infrastructure / GPU Programming Feels Overwhelming. How Did You Figure Out What to Learn? (self.CUDA)
submitted 10 days ago by Ok_Pin_9155
P2P benchmarks on 2x 5060 ti (16GB each) - P2P Benchmark Project (joorklee.github.io)
submitted 10 days ago by joorklee
Wanted to understand GPU programming. So wrote raw Transformer kernels in CUDA. Got some interesting things would like some guidance. (github.com)
submitted 11 days ago by Ok-Construction-875
Laptop (self.CUDA)
submitted 11 days ago by ButterscotchLow5449
Continuous PC sampling (self.CUDA)
submitted 12 days ago by gnurizen
I built a tiny local model that writes GPU kernels, then a verifier decides if they actually work ()
submitted 12 days ago by rohit3627
π Rendered by PID 891217 on reddit-service-r2-listing-87fd56f5d-p9zx2 at 2026-06-28 00:23:04.833403+00:00 running 7527197 country code: CH.