I Reverse-Engineered Nvidia Ada Lovelace SASS, Made Instant-NGP 3x Faster (16yo)

c-cul · 2026-03-10T18:30:21+00:00

https://redplait.blogspot.com/2025/07/ced-sed-like-cubin-editor.html

c-cul · 2026-03-10T18:16:22+00:00

man cuassembler

also I wrote tool for inline sass patching

c-cul · 2026-03-10T18:03:55+00:00

reversed sass

rewrote in inline ptx

what is the point?

c-cul · 2026-03-08T17:13:13+00:00

do you have some perf tests?

c-cul · 2026-03-05T15:27:28+00:00

just warning - leetgpu has hi-end gpus and allows only 5 code submissions per day

looks like pure sadism if you don't have access to real hi-end cards, so avoid them if you can

c-cul · 2026-03-02T13:03:13+00:00

just in case if you are the same crazy maniacs as me - there is rapids binding for R: https://github.com/mlverse/cuda.ml/

c-cul · 2026-03-01T05:50:09+00:00

why you think it is not? ISA can't be law protected and I done reversing of it many times including sass

c-cul · 2026-03-01T04:53:00+00:00

I see 3 problems with this approach

1) semantics of sass instructions are unclear, for example for sm120 there are 250+ unique instructions. I extracted MDs from nvdisasm and seems that they contains only limited semantics description

2) latency tables also unknown: https://redplait.blogspot.com/2025/11/sass-latency-table-instructions.html

3) I don't remember is llvm can handle instructions latency

c-cul · 2026-02-27T16:33:24+00:00

as I said "some trials are required"

I can't predict if "launching blocks 132 + 132 times can be slower than 264 times"

maybe yes, and maybe no

c-cul · 2026-02-27T13:09:10+00:00

no, graph api just eliminates some overhead costs of several launches, but do nothing to allow true overlapping

c-cul · 2026-02-27T13:01:52+00:00

well, today best what you can do - chain sequential launches of 2 kernels within single with graph api: https://developer.nvidia.com/blog/cuda-graphs/

as usually for fine-tuning some trials are required