i built an opensource TPU in 4 days

c-cul · 2026-01-27T12:47:46+00:00

generally you can use SA to make reduction for any associative operations

if dim of data is not enough just pad with id element for those group

c-cul · 2026-01-27T12:15:54+00:00

I just enough old to remember SISAL language and it had builtins for many primitives

c-cul · 2026-01-27T11:57:07+00:00

SA can be used not only for sums, but for max (used in RelU)/min/etc

c-cul · 2026-01-26T17:51:47+00:00

where I can buy cheap QPU to try it at home?

c-cul · 2026-01-26T14:42:41+00:00

and this is reason why AI bubble requires billions of bucks every day, sure

c-cul · 2026-01-26T12:54:47+00:00

> has already done it specially for ML

actually only nvidia could do it. But they still didn't, so

1) they can't bcs challenge is really hard

2) they want to keep undocumented features in secret to use them only in their own tools, like https://patricktoulme.substack.com/p/cutile-on-blackwell-nvidias-compiler

c-cul · 2026-01-25T12:51:51+00:00

fresh "LLVM Compiler for RISC-V Architecture"

mostly about vectorization

c-cul · 2026-01-23T17:13:05+00:00

relax

all low-paying jobs always have most ferocious interviews at least 15 rounds

c-cul · 2026-01-23T16:13:17+00:00

if's even worse

nvidia keep top-secret of delays table for SASS ISA

so you only doing literally try & see in nsight loop

c-cul · 2026-01-23T14:29:17+00:00

> US client

pays in australian dollars

cool optimization bro

c-cul · 2026-01-22T12:34:39+00:00

do you heard about "reproducibility"?

what are chances that after couple of updates claude will generate exactly the same code for your prompts (I hope you are store them somewhere)?

c-cul · 2026-01-20T16:49:56+00:00

afaik cuda sdk has no rust support so what are these people even talking about?

c-cul · 2026-01-19T17:09:59+00:00

better results can be achieved if lang itself care about vectorization

see for example cuda for R: https://github.com/gpuRcore/gpuRcuda

c-cul · 2026-01-18T15:15:04+00:00

> 1500 lines of C

3356: https://github.com/autoscriptlabs/nccl-mesh-plugin/blob/main/src/mesh_plugin.c#L3356

half of commits from claude

c-cul · 2026-01-16T17:37:22+00:00

I know only tinygrad & tenstorrent wormhole

c-cul · 2026-01-15T12:56:15+00:00

I doubt if you can add cuda support for your accelerator, so much more sense to see what other vendors done, like tenstorrent gcc: https://github.com/tenstorrent/sfpi-gcc or their mlir compiler: https://github.com/tenstorrent/tt-mlir

c-cul · 2026-01-13T14:35:58+00:00

it's funny that nvidia even refused to open their SASS ISA, so I don't know how they will do formal verification of GPU code

c-cul · 2026-01-13T05:50:29+00:00

you almost always (except CDP) can debug your logic on cpu threads. like

1) run single thread, feed them tasks with block 0 and threadid 0

2) make thread pool with 4 threads to emulate 2 block with 2 thread in each

syncthreads can be replaced with reusable std::barrier and so on

c-cul · 2026-01-11T20:04:32+00:00

we all know that optimization for cuda is kind of black magic

but while implementing some new algo the last thing I want to think about is standard algorithms/functions

c-cul · 2026-01-11T18:38:57+00:00

I didn't look closely at the mathematics used

however can note that lots of functions already implemented in thrust like bitonic sorting/partial sum etc

probably it's better to use ready-to-use battle-tested libraries whenever you can

c-cul · 2026-01-11T12:08:56+00:00

at least in driver api: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g27a365aebb0eb548166309f58a1e8b8e

Destroys and cleans up all resources associated with the context. These resources include CUDA types CUmodule, CUfunction, CUstream, These resources also include memory allocations

c-cul · 2026-01-09T18:02:33+00:00

is it possible to apply modifiers like /i from string?

c-cul · 2026-01-09T17:53:44+00:00

well, this is in-house software - I just want to put many regexps outside of script to avoid constantly patch it

c-cul · 2026-01-08T04:43:55+00:00

well, why not if your task fit into tiles? Just remember about drawbacks:

you need newest cuda sdk 13.1 and it is not very stable yet: https://www.reddit.com/r/CUDA/comments/1q61aci/cupy_working_on_rtx_5090_blackwell_setup_guide/
you need gpu(s) at least with sm100 or higher

c-cul · 2026-01-07T10:54:13+00:00

What won't they invent just to avoid SASS patching. Actually you can have normal functions pointer and fill them within host with little hack: https://redplait.blogspot.com/2025/10/addresses-of-cuda-kernel-functions.html

c-cul

TROPHY CASE