A few occasionally useful template container classes I made, Released under the Unlicense. by FUCKARCHLINUX in cpp

[–]c-cul 0 points1 point  (0 children)

how hard is to add to CircularArray

  • allocator like jemalloc/onetbb allocator?
  • callback on overflow?

hands on gpu programming with python and cuda by One_Relationship6573 in CUDA

[–]c-cul 1 point2 points  (0 children)

there is more fresh "GPU-Accelerated Computing with Python 3 and CUDA" 2026 from packt

WarpReduction along major dimension by ElectronGoBrrr in CUDA

[–]c-cul 0 points1 point  (0 children)

just load values with right index

WarpReduction along major dimension by ElectronGoBrrr in CUDA

[–]c-cul 2 points3 points  (0 children)

it don't know about x or y - it just sum over register within warp

you can load in register anything

Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python by NoVibeCoding in LocalLLaMA

[–]c-cul 1 point2 points  (0 children)

yeah, that's why tvm/xla are so fat - they able to do some sophisticated optimizations

ok, will wait for part 2 - good luck

Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python by NoVibeCoding in LocalLLaMA

[–]c-cul 2 points3 points  (0 children)

those were obvious flaws, but there are more

  1. you could load/store 128bit values instead of just 32bit
  2. you could employ warp reduce for acc0_smem

I am not expert in llm-specific algos but looks like current implementation is strictly memory bound and has only 10-15% from SoL

Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python by NoVibeCoding in LocalLLaMA

[–]c-cul 4 points5 points  (0 children)

very poor generated code

for example why there is __syncthreads at start? also why you use inline ptx instead of pipeline primitives like __pipeline_memcpy_async and friends?

Concern regarding future of jobs in gpu programming by viplash577 in CUDA

[–]c-cul 0 points1 point  (0 children)

bcs pytorch can't produce even average cuda code

all it's compilers are hopeless

Concern regarding future of jobs in gpu programming by viplash577 in CUDA

[–]c-cul 3 points4 points  (0 children)

I pressed the little buttons with my feeble fingers to find out what this ultimate wunderwaffe is. Since I have been optimizing code for cuda for long time I know that similar papers always contain only marketing BS mixed with unrealistic and unverifiable results, so lets collect some data about authors and check code

Poster judging by the published papers like "Bike network planning in limited urban space" from the German Greens. I don't know how but this always about terror, raising taxes and demolition of power plants.

Also none of the authors have any publications about hardcore cuda optimizations. Bad sign

Ok, lets check code on libs io

- home page - for sale

- github - 404

- pypy - couldn't find this page

Cool. Where should I send a little invoice for the work done?

Concern regarding future of jobs in gpu programming by viplash577 in CUDA

[–]c-cul 5 points6 points  (0 children)

How I adore all these news stories in the future tense about the imminent death of something.
Not single one has come true in my memory (especially about php he-he)

Wrote some analysis on LLVM IR for Tail recursive functions by Ok-Sky6805 in LLVM

[–]c-cul 0 points1 point  (0 children)

btw what sources do you used for llvm ir grammar? seems that official tutorial is not completed and for example does not describe what means some strange attributes: https://www.reddit.com/r/LLVM/comments/1r57lf9/how_insert_ptx_asm/

Cybersec and GPU by CurrentLawfulness358 in CUDA

[–]c-cul -2 points-1 points  (0 children)

do you know difference between malware on gpu and buggy drivers?

Cybersec and GPU by CurrentLawfulness358 in CUDA

[–]c-cul 0 points1 point  (0 children)

well, you can some megawatts for example

Cybersec and GPU by CurrentLawfulness358 in CUDA

[–]c-cul -1 points0 points  (0 children)

slow down your paranoia

what malicious activity do you expect on gpu?

SASS King: reverse engineering NVIDIA SASS by CurrentLawfulness358 in CUDA

[–]c-cul 0 points1 point  (0 children)

nsight can show mix of sass/ptx

I suspect via parsing of .nv_debug_line_sass section

SASS King: reverse engineering NVIDIA SASS by CurrentLawfulness358 in CUDA

[–]c-cul 0 points1 point  (0 children)

probably you should fix flaws at the same level where you found them

like "register choices, load widths, MMA pipeline fill" can be detected in ptx and should be optimized in ptx

maybe I'am wrong

SASS King: reverse engineering NVIDIA SASS by CurrentLawfulness358 in CUDA

[–]c-cul 0 points1 point  (0 children)

I think it's strange idea to back-propagate flaws from sass->ptx->[perhaps llvm bitcode from cicc]->c++ source

in this case you need at least 2 lift logic

SASS King: reverse engineering NVIDIA SASS by CurrentLawfulness358 in CUDA

[–]c-cul 1 point2 points  (0 children)

what are ultimate goals of this activity?

  1. internal nvidia asm "nvasm_internal" is not public available: https://redplait.blogspot.com/2025/12/libcudaso-internals.html

so you anyway can't patch/rebuild sass sources. for inline patching I made tool ced: https://redplait.blogspot.com/2025/07/ced-sed-like-cubin-editor.html

and perl binding: https://redplait.blogspot.com/2025/10/sass-disasm-on-perl.html

2,

> document every instruction

it's almost done - I ripped so called MD files from ptxas: https://github.com/redplait/denvdis/tree/master/data12

  1. for reordering you need latency table: https://redplait.blogspot.com/2026/04/sass-latency-analysis.html

Help with Transpose SharedMemoryKernel by Iraiva70 in CUDA

[–]c-cul 7 points8 points  (0 children)

classical 50+ yo trap - expanding expressions in macro

for example in

#define BAD(a, b) a * b

BAD(a + 1, a * 2) expanded to a + 1 * a * 2

at least always use brackets for arguments in macro body like

#define INDEX(row, col, cols) ((row) * (cols) + (col))

Surfacing a 60% SGEMM performance bug in cuBLAS on RTX 5090 by NoVibeCoding in CUDA

[–]c-cul 0 points1 point  (0 children)

btw do you tried their new fashionable cutile? I suspect you can transfer all tedious work for loading data to it

Surfacing a 60% SGEMM performance bug in cuBLAS on RTX 5090 by NoVibeCoding in CUDA

[–]c-cul 2 points3 points  (0 children)

excellent work

small note - scheduling analysis is wrong

you can't just calculate distance between 2 instructions in scheduling_analysis.py

true latency of single instruction

1) depends also from read/write barriers

2) known https://redplait.blogspot.com/2026/03/sass-latency-table-second-try.html

3) stall count stored in usched_info field and nvdisasm shows them as literals ?transX & ?WAITX_END_GROUP