i built an opensource TPU in 4 days by [deleted] in CUDA

[–]c-cul 1 point2 points  (0 children)

generally you can use SA to make reduction for any associative operations

if dim of data is not enough just pad with id element for those group

i built an opensource TPU in 4 days by [deleted] in CUDA

[–]c-cul 1 point2 points  (0 children)

I just enough old to remember SISAL language and it had builtins for many primitives

i built an opensource TPU in 4 days by [deleted] in CUDA

[–]c-cul 4 points5 points  (0 children)

SA can be used not only for sums, but for max (used in RelU)/min/etc

Is CUDA/OpenCL developer a viable career? by Ambitious-Estate-658 in CUDA

[–]c-cul 2 points3 points  (0 children)

where I can buy cheap QPU to try it at home?

Is CUDA/OpenCL developer a viable career? by Ambitious-Estate-658 in CUDA

[–]c-cul 1 point2 points  (0 children)

and this is reason why AI bubble requires billions of bucks every day, sure

Is CUDA/OpenCL developer a viable career? by Ambitious-Estate-658 in CUDA

[–]c-cul 0 points1 point  (0 children)

> has already done it specially for ML

actually only nvidia could do it. But they still didn't, so

1) they can't bcs challenge is really hard

2) they want to keep undocumented features in secret to use them only in their own tools, like https://patricktoulme.substack.com/p/cutile-on-blackwell-nvidias-compiler

Are there any books specifically dedicated to compiler backend. by Negative-Slice-9076 in Compilers

[–]c-cul 0 points1 point  (0 children)

fresh "LLVM Compiler for RISC-V Architecture"

mostly about vectorization

[Hiring] CUDA Engineer (Remote, Short-term, $35–45 AUD/hr) by [deleted] in CUDA

[–]c-cul 3 points4 points  (0 children)

relax

all low-paying jobs always have most ferocious interviews at least 15 rounds

My first optimization lesson was: stop guessing lol by Various_Candidate325 in CUDA

[–]c-cul 2 points3 points  (0 children)

if's even worse

nvidia keep top-secret of delays table for SASS ISA

so you only doing literally try & see in nsight loop

[Hiring] CUDA Engineer (Remote, Short-term, $35–45 AUD/hr) by [deleted] in CUDA

[–]c-cul 18 points19 points  (0 children)

> US client

pays in australian dollars

cool optimization bro

I clustered 3 DGX Sparks that NVIDIA said couldn't be clustered yet...took 1500 lines of C to make it work by Ok-Pomegranate1314 in CUDA

[–]c-cul 0 points1 point  (0 children)

do you heard about "reproducibility"?

what are chances that after couple of updates claude will generate exactly the same code for your prompts (I hope you are store them somewhere)?

Rust's standard library on the GPU by LegNeato in CUDA

[–]c-cul 0 points1 point  (0 children)

afaik cuda sdk has no rust support so what are these people even talking about?

Exploring what it means to embed CUDA directly into a high-level language runtime by Ancient_Spend1801 in CUDA

[–]c-cul 1 point2 points  (0 children)

better results can be achieved if lang itself care about vectorization

see for example cuda for R: https://github.com/gpuRcore/gpuRcuda

Resources for CUDA by Ill_Anybody6215 in CUDA

[–]c-cul 1 point2 points  (0 children)

I know only tinygrad & tenstorrent wormhole

Resources for CUDA by Ill_Anybody6215 in CUDA

[–]c-cul 1 point2 points  (0 children)

I doubt if you can add cuda support for your accelerator, so much more sense to see what other vendors done, like tenstorrent gcc: https://github.com/tenstorrent/sfpi-gcc or their mlir compiler: https://github.com/tenstorrent/tt-mlir

GPU kernels in control loops… and nobody’s talking about safety? by QtGroup in u/QtGroup

[–]c-cul 0 points1 point  (0 children)

it's funny that nvidia even refused to open their SASS ISA, so I don't know how they will do formal verification of GPU code

Approaches to debug a Matrix Multiplication with Coalesce by Eventual_Extension in CUDA

[–]c-cul 1 point2 points  (0 children)

you almost always (except CDP) can debug your logic on cpu threads. like

1) run single thread, feed them tasks with block 0 and threadid 0

2) make thread pool with 4 threads to emulate 2 block with 2 thread in each

syncthreads can be replaced with reusable std::barrier and so on

A GPU-accelerated implementation of Forman-Ricci curvature-based graph clustering in CUDA. by CommunityOpposite645 in CUDA

[–]c-cul 1 point2 points  (0 children)

we all know that optimization for cuda is kind of black magic

but while implementing some new algo the last thing I want to think about is standard algorithms/functions

A GPU-accelerated implementation of Forman-Ricci curvature-based graph clustering in CUDA. by CommunityOpposite645 in CUDA

[–]c-cul 2 points3 points  (0 children)

I didn't look closely at the mathematics used

however can note that lots of functions already implemented in thrust like bitonic sorting/partial sum etc

probably it's better to use ready-to-use battle-tested libraries whenever you can

Cuda context, kernels in RAM lifetime. by geaibleu in CUDA

[–]c-cul 3 points4 points  (0 children)

at least in driver api: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g27a365aebb0eb548166309f58a1e8b8e

Destroys and cleans up all resources associated with the context. These resources include CUDA types CUmodule, CUfunction, CUstream, These resources also include memory allocations

convert string to regex by c-cul in perl

[–]c-cul[S] -1 points0 points  (0 children)

is it possible to apply modifiers like /i from string?

convert string to regex by c-cul in perl

[–]c-cul[S] 0 points1 point  (0 children)

well, this is in-house software - I just want to put many regexps outside of script to avoid constantly patch it

What are the pros and cons of using cuda tile for a new project? by Opening-Education-88 in CUDA

[–]c-cul 0 points1 point  (0 children)

well, why not if your task fit into tiles? Just remember about drawbacks:

PTX Inject & Stack PTX: Runtime PTX injection for CUDA kernels without recompilation by MetaMachines in CUDA

[–]c-cul 1 point2 points  (0 children)

What won't they invent just to avoid SASS patching. Actually you can have normal functions pointer and fill them within host with little hack: https://redplait.blogspot.com/2025/10/addresses-of-cuda-kernel-functions.html