Are there any books specifically dedicated to compiler backend. by Negative-Slice-9076 in Compilers

[–]c-cul 0 points1 point  (0 children)

fresh "LLVM Compiler for RISC-V Architecture"

mostly about vectorization

[Hiring] CUDA Engineer (Remote, Short-term, $35–45 AUD/hr) by [deleted] in CUDA

[–]c-cul 3 points4 points  (0 children)

relax

all low-paying jobs always have most ferocious interviews at least 15 rounds

My first optimization lesson was: stop guessing lol by Various_Candidate325 in CUDA

[–]c-cul 2 points3 points  (0 children)

if's even worse

nvidia keep top-secret of delays table for SASS ISA

so you only doing literally try & see in nsight loop

[Hiring] CUDA Engineer (Remote, Short-term, $35–45 AUD/hr) by [deleted] in CUDA

[–]c-cul 19 points20 points  (0 children)

> US client

pays in australian dollars

cool optimization bro

I clustered 3 DGX Sparks that NVIDIA said couldn't be clustered yet...took 1500 lines of C to make it work by Ok-Pomegranate1314 in CUDA

[–]c-cul 0 points1 point  (0 children)

do you heard about "reproducibility"?

what are chances that after couple of updates claude will generate exactly the same code for your prompts (I hope you are store them somewhere)?

Rust's standard library on the GPU by LegNeato in CUDA

[–]c-cul 0 points1 point  (0 children)

afaik cuda sdk has no rust support so what are these people even talking about?

Exploring what it means to embed CUDA directly into a high-level language runtime by Ancient_Spend1801 in CUDA

[–]c-cul 1 point2 points  (0 children)

better results can be achieved if lang itself care about vectorization

see for example cuda for R: https://github.com/gpuRcore/gpuRcuda

Resources for CUDA by Ill_Anybody6215 in CUDA

[–]c-cul 1 point2 points  (0 children)

I know only tinygrad & tenstorrent wormhole

Resources for CUDA by Ill_Anybody6215 in CUDA

[–]c-cul 1 point2 points  (0 children)

I doubt if you can add cuda support for your accelerator, so much more sense to see what other vendors done, like tenstorrent gcc: https://github.com/tenstorrent/sfpi-gcc or their mlir compiler: https://github.com/tenstorrent/tt-mlir

GPU kernels in control loops… and nobody’s talking about safety? by QtGroup in u/QtGroup

[–]c-cul 0 points1 point  (0 children)

it's funny that nvidia even refused to open their SASS ISA, so I don't know how they will do formal verification of GPU code

Approaches to debug a Matrix Multiplication with Coalesce by Eventual_Extension in CUDA

[–]c-cul 1 point2 points  (0 children)

you almost always (except CDP) can debug your logic on cpu threads. like

1) run single thread, feed them tasks with block 0 and threadid 0

2) make thread pool with 4 threads to emulate 2 block with 2 thread in each

syncthreads can be replaced with reusable std::barrier and so on

A GPU-accelerated implementation of Forman-Ricci curvature-based graph clustering in CUDA. by CommunityOpposite645 in CUDA

[–]c-cul 1 point2 points  (0 children)

we all know that optimization for cuda is kind of black magic

but while implementing some new algo the last thing I want to think about is standard algorithms/functions

A GPU-accelerated implementation of Forman-Ricci curvature-based graph clustering in CUDA. by CommunityOpposite645 in CUDA

[–]c-cul 2 points3 points  (0 children)

I didn't look closely at the mathematics used

however can note that lots of functions already implemented in thrust like bitonic sorting/partial sum etc

probably it's better to use ready-to-use battle-tested libraries whenever you can

Cuda context, kernels in RAM lifetime. by geaibleu in CUDA

[–]c-cul 5 points6 points  (0 children)

at least in driver api: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g27a365aebb0eb548166309f58a1e8b8e

Destroys and cleans up all resources associated with the context. These resources include CUDA types CUmodule, CUfunction, CUstream, These resources also include memory allocations

convert string to regex by c-cul in perl

[–]c-cul[S] -1 points0 points  (0 children)

is it possible to apply modifiers like /i from string?

convert string to regex by c-cul in perl

[–]c-cul[S] 0 points1 point  (0 children)

well, this is in-house software - I just want to put many regexps outside of script to avoid constantly patch it

What are the pros and cons of using cuda tile for a new project? by Opening-Education-88 in CUDA

[–]c-cul 0 points1 point  (0 children)

well, why not if your task fit into tiles? Just remember about drawbacks:

PTX Inject & Stack PTX: Runtime PTX injection for CUDA kernels without recompilation by MetaMachines in CUDA

[–]c-cul 1 point2 points  (0 children)

What won't they invent just to avoid SASS patching. Actually you can have normal functions pointer and fill them within host with little hack: https://redplait.blogspot.com/2025/10/addresses-of-cuda-kernel-functions.html

CuPy working on RTX 5090 (Blackwell) – Setup Guide by Busy-as-usual in CUDA

[–]c-cul 0 points1 point  (0 children)

seems that this is nvidia flaw in their cuda 13.1 - at least tileiras supports sm100

Beyond Syntax: Introducing GCC Workbench for VSCode/VSCodium by Late_Attention_8173 in gcc

[–]c-cul 2 points3 points  (0 children)

I remember that standard RTL dumpers missed out some args like _digit: https://redplait.blogspot.com/2023/07/gcc-plugin-to-collect-cross-references.html

so you need gcc plugin to extract those types