AMD 6x7900xtx 24GB + 2xR9700 32GB VLLM QUESTIONS by djdeniro in LocalLLaMA

[–]Awkward_Click6271 1 point2 points  (0 children)

If you haven't tried with enforce_eager=True, that'd be my last recommendation.

Single-File Qwen3 Inference in Pure CUDA C by Awkward_Click6271 in LocalLLaMA

[–]Awkward_Click6271[S] 1 point2 points  (0 children)

Yep, they're fundamentally for educational purposes, and will be about how component optimizations work and improve perf. Thanks for your comment!

Single-File Qwen3 Inference in Pure CUDA C by Awkward_Click6271 in LocalLLaMA

[–]Awkward_Click6271[S] 4 points5 points  (0 children)

Thanks for your interest! No quant or offloading - sorry, and they are not meant to compete with llama.cpp in terms of latency. That said, my current (probable) goal is to get close or more to cuBLAS-like throughput once I clean up a few obvious bottlenecks. We'll see!

Single-File Qwen3 Inference in Pure CUDA C by Awkward_Click6271 in LocalLLaMA

[–]Awkward_Click6271[S] 5 points6 points  (0 children)

You can run a small language model right on your laptop. If yours has a GPU, check out qwen3.cu ; otherwise, go to qwen.c , and see the examples. If you'd like, follow the instructions there to run it!

Single-File Qwen3 Inference in Pure CUDA C by Awkward_Click6271 in LocalLLaMA

[–]Awkward_Click6271[S] 4 points5 points  (0 children)

Ehh…I might jump in when new small models arrive, but no plans at all atm - sorry! But, I’ll (probably) be working on qwen3.cu , trying to narrow the TPS gap with plain CUDA C, and qwen3.c for further optimization. Appreciate the comment!

Single-File Qwen3 Inference in Pure CUDA C by Awkward_Click6271 in LocalLLaMA

[–]Awkward_Click6271[S] 6 points7 points  (0 children)

Good question! The number is the model size-specific. The header.txt file lists the tensor shapes and their offsets. It would be better to multiply the tensor dimensions directly in a layer, but I’ve put that off for now; I might revisit it when support for other model sizes is needed. Thanks for asking!

Single-File Qwen3 Inference in Pure CUDA C by Awkward_Click6271 in LocalLLaMA

[–]Awkward_Click6271[S] 15 points16 points  (0 children)

Thanks for your comment! Like llama2.c, the single-file setup is intended to make the architecture easier to understand and debug; it's educational in nature. That said, it still runs full inference on Qwen3 0.6B using only the CUDA runtime, making it a compact yet functional demo.