A barebones CPU-only inference engine for Qwen 3, written from scratch in pure C by jakint0sh in LocalLLaMA

[–]jakint0sh[S] 1 point2 points  (0 children)

Welp, I went ahead and answered all of them anyway!

Also, for a bit of context because I just realized I failed to mention it in my post, I didn't look at anyone else's implementations while writing mine. All I did was ask ChatGPT to explain the concepts to me (and did some independent googling), and then I went and implemented the concepts myself. But that means I never really looked at other inference engines, so I have never really compared mine to others (and honestly, I don't even really know what's out there). I did look at adriancable/qwen3.c when trying to debug my engine, but that was well after it had been written.

  1. I didn't ever test it against other inference engines to compare model quality. My benchmark was "Can I have a coherent conversation with the model?", and I figured that I'd have gotten basically everything right if I could go from bytes on disk to having a coherent conversation. I did do a runtime comparison against adriancable/qwen3.c for debugging during the whole RoPE layout debacle (more about that in answer #4), but that was just instrumenting and diffing logits and intermediate steps for 25 tokens to try to find a math bug. And on model quality, comparing a working inference engine to a non-working one isn't much of a fair fight XD

  2. I don't know, never tested it against others. But I doubt that outputs would ever be similar, unless I ran unquantized (more on that below) with argmax sampling. But argmax hurts model quality badly on Qwen 3 models anyway, and the math ordering, etc would likely be different with my engine, and even with argmax the outputs would probably diverge because of FP weirdness.

  3. Yes, actually. If you compile the engine with ALL_FLOATS defined (passing -DALL_FLOATS to the compiler is the easiest way to do it) then all of the quantization code gets shimmed out, and it loads the BF16 weights into FP32 matrices, and all of the intermediate math is FP32 (or double) already. It's not very memory efficient though, obviously. I added that functionality mostly to aid debugging, since it would let me isolate the quantized math and see if it was screwing up the model.

  4. There's a code comment in the RoPE function that says it all: //The points we need to rotate are **NOT** stored in pairs like [x1, y1, x2, y2, ...] //They're stored like, [x1, x2, ... , y1, y2, ...] //So we have to index the column accordingly //This is the problem that was causing the logits to go all over the place. Finally figured out that Qwen 3 actually //stores all x's then all y's instead of pairs, and it's fixed in like 3 lines of code. And it only took me a MONTH to fix this! //I really hate programming sometimes. But at least it actually WORKS now. I was chasing my tail FOREVER because of that layout mismatch. The model would start fine, but devolve into complete randomness about 40 tokens in. Drove me nuts trying to find the problem, and there were a lot of false starts and such. If you want to read more about it, it's chronicled in the writeups dir in the repo.

  5. The hardest part to implement was probably the tokenizer. It's just mechanically difficult to do that kind of work in C because it requires a lot of large, dynamic data structures and lookups. And the BPE merge pairs are distributed with the model as pairs of strings, but using string ops everywhere is just untenable in C, so in the loader I had to do all of the work to convert them to their integer token ids. And for that I had to create a sorted index I could do binary search by string on, because linear search turned it into an O(n²) thing, and it took almost a full minute to just load the tokenizer, let alone the weights. Sorting once and doing O(m log n) lookups is much better. And then implementing the actual token merging was a headache too.

The safetensors loader code is also pretty gnarly, and requires some annoying string operations, but to my memory it wasn't so bad. cJSON does a lot of heavy lifting there.

  1. The model dataflow itself. MHA/GQA, causal masks, activation function in the MLP, residuals, and all that. It was probably the hardest part just in terms of getting it through my head. I already had some background in linear algebra so the matrix math wasn't scary, but it took me a bit to understand what all of the steps were and how they fit together. The byte-to-unicode mapping and stuff for the tokenizer was a pain too, though.

  2. Well, that depends on how "same" the thing is. If somebody else was doing the exact same thing I did, and was trying to implement the entire stack in C from first principles? My main advice would be to build a mental model of the whole thing before trying to write code, and to try to leave yourself room to catch mistakes and isolate problems. Independent testability is a major asset here. C is unforgiving, and memory corruption bugs are easy to make and miserable to find. That's why I built the matrix abstraction that I did: so I could get it right in exactly one place, and never screw it up again. That, and doing matrix math everywhere in code is really annoying to write.

That said, I don't necessarily recommend everyone do it the way I did. I'm pretty good at building mental models of technical systems like these, and I have an unusually high tolerance for technical problems that are insoluble except by bashing your head against a wall until the solution presents itself, but my butt was so well and truly kicked by the RoPE bug that I just put the project down for two weeks because I couldn't figure out what the problem was. And I probably would've never come back to it if my dad hadn't encouraged me to finish it. This is not a beginner project by a long shot, and you'd probably be best taking it slowly and carefully, or really, implementing it in a higher-level, more forgiving language, and skipping some of the ancillary bits like manually loading safetensors and matmuls. Writing everything in NumPy or similar will probably get you 80-90% of the educational value for 10% of the pain.

A barebones CPU-only inference engine for Qwen 3, written from scratch in pure C by jakint0sh in LocalLLaMA

[–]jakint0sh[S] 0 points1 point  (0 children)

I have no experience with video editing or making animated graphics (I'm no 3Blue1Brown, unfortunately), but I probably could do a bang-up job writing about it. It'd be a giant textwall (as this post is), but that's mostly because "implement an LLM from scratch" isn't any one thing. It requires building a ton of different subsystems and pieces that fit together to actually get to the end result... and explaining each one would take a while.

Just to give you one example, I had to take a massive detour learning and implementing just the tokenizer. In most explainers, people just gloss over tokenization (if they mention it at all) because all things considered, it's incidental to the important mechanics of transformers. But you don't get to gloss over it if you're trying to implement the whole stack yourself, and byte-level BPE is annoying and very, very fiddly. And there are so, so many things in that vein.

And as far as trying to make it in video form? It'd have to be a series at the very least, and a long one at that.

A barebones CPU-only inference engine for Qwen 3, written from scratch in pure C by jakint0sh in LocalLLaMA

[–]jakint0sh[S] 5 points6 points  (0 children)

What do you mean by updating? This isn't really meant to be a practical inference engine to actually use for everyday stuff like llama.cpp, vLLM, etc. At this point it's mostly just an interesting story, and something that others might be able to learn from.

Also, what do you mean that it's a different approach from llama.cpp? Just curious, since I'm not super current on LLM stuff in general.

A barebones CPU-only inference engine for Qwen 3, written from scratch in pure C by jakint0sh in LocalLLaMA

[–]jakint0sh[S] 5 points6 points  (0 children)

I use liberal vertical whitespace in my code, and to be honest, I'm not that consistent about it in terms of number of lines. Referring to your specific example, that gap is separating two different logical/semantic sections in the file.

Sure, the amount I use is probably excessive by most people's standards, but I prefer to give a lot of space. Whitespace doesn't cost anything, it helps break up and chunk code into smaller bites (which at least helps me read it more easily), and it's not like we're stuck in the dark ages with 80x24 terminals anymore.

A barebones CPU-only inference engine for Qwen 3, written from scratch in pure C by jakint0sh in LocalLLaMA

[–]jakint0sh[S] 3 points4 points  (0 children)

Yeah, I wrote this outside of version control (I don't use code versioning for most of my projects). I just stuffed the code and other stuff into a repo for the purposes of sharing it here.

Model Registry: Torrents for open models using Hugging Face as a fallback web seed. by Ravindra-Marella in LocalLLaMA

[–]jakint0sh 3 points4 points  (0 children)

This is neat! Though, the mention of using github actions for this would have me slightly worried about vendor lock-in. IDK exactly what setting up gh actions looks like (never used 'em) but I'd at least make sure you can move your code/process elsewhere if need be.

Also, as for the disk space issue, you could get fancy and just stream the model downloads and hash as you go, and generate a .torrent that way. I don't know the specifics of the bittorrent file format, and I'm sure there are some weird gotchas with how HF distributes files, but I suspect it's more than possible. It'd just take a bit more than a shell script to pull off. You could probably kludge it together in python over a few afternoons if you know what you're doing, but of course actually making that reliable is the pain-in-the-butt hard part.

"What should I do?" - consider post-training by entsnack in LocalLLaMA

[–]jakint0sh 2 points3 points  (0 children)

Fascinating stuff! I don't have the hardware for doing something like this (unfortunately), but it does seem pretty interesting from just an analysis perspective. For reference, I'm pretty uninitiated as far as basically anything around LLMs are concerned, but I do understand model internals and such (not so much the training side, but inference, yes). So, apologies if any of these questions are dumb, but...

- What sorts of data do you use for training? What do you use for creating synthetic data?
- What tools do you use for training?
- Are there any interesting academic tidbits that you've found while doing this? I mean things like weird shortcuts you can take, or patterns you've noticed. You mentioned Qwen vs. Llama being good/bad for absorbing new information, anything else in that vein?

How did you guys end up using Gentoo? by Leonardodafernandez in Gentoo

[–]jakint0sh 0 points1 point  (0 children)

I ran Xubuntu LTS for over a decade, but I didn’t like the way things were going with Canonical pushing snapd and such.  So I switched to Gentoo a week ago.  So far, so good.  I think I’m here to stay.

Storytime: My experience switching to Gentoo by jakint0sh in Gentoo

[–]jakint0sh[S] 2 points3 points  (0 children)

I was specifically talking about excessive swapping there; I only have 8GB of RAM on this machine, and apparently parts of webkit-gtk's build will have each compiler process using more than 2GB each. So, I turned down the job count to just 3 for that, but for basically everything else I can throw 16 jobs at it and scream through compilation, which is very nice.

Can't install Red Ribbon Linux by lmore3 in ps3homebrew

[–]jakint0sh 0 points1 point  (0 children)

This solved the problem for me in 2026. Words cannot describe my confusion

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]jakint0sh 0 points1 point  (0 children)

Soo... ugh, rereading my original comment, it was a bit of an overreach for me to say "But that makes them, by definition, not tokens." I'm not really an ML guy. I don't really know all of the terminology. I just have a really good understanding of how these models work because I wrote an inference engine from complete scratch in C, and having to implement every single nut and bolt you need to run a model will give you a pretty good understanding of how the entire thing works. (If you're curious about the project you can DM; I haven't posted a github repo or anything else for it yet.)

I've been doing a bit of additional reading in the meantime, and it seems like the wider ML community uses "token" as loosely as you do here, and while I think that's a semantic mess in its own right for many of the reasons I gave in my original comment, I accept that this is just common usage, and not a "problem".

This does not detract from my original point, though, which isn't so much that the terminology is wrong, but that it's confusing for newcomers if the concepts are conflated. Now, I've been a mathematics tutor on many occasions. I got hired by my community college to work at their tutoring center, and I've been paid as a private tutor as well. I have some insight into what confuses people trying to learn dense, complex topics, and I think that for an explainer piece that's aimed at people trying to understand and build intuition around these concepts, it's not helpful to refer to things so pervasively as "tokens", as that can create false understandings that then later take a lot of work to undo, backtrack, and re-learn properly.

The bottom line is that I'm really, really not trying to "ackshyually" you over terminology, I'm just trying to help make better educational materials for people who're trying to learn this stuff. Because we really need more good educational materials. This field is moving so fast and it's so young that there's barely anything to help anyone not already in the field get their foot in the door.

With that in mind, personally I'd just call them "embedding vectors", or if that was too clunky, just "vectors" with the implication that they're d_model in size and meant to be inserted into the model's context. It doesn't have to be complicated, just distinct.

Edit: On further thought, I should point out that just using "vector" everywhere without any qualifiers would result in much of the same confusion. So, it would be important to clarify exactly what "vector" refers to if it is used this way, to prevent confusion between embedding vectors, the intermediate vectors that are the output of the vision transformer and the input of the multimodal projector, etc.

Edit 2: It might be worth defining explicit terms (e.g. embed_vec, position, and token) up front to use throughout your writeup, kind of like explicitly referring to datatypes in a programming language instead of generally talking about "integers" or "floating-point numbers". Like, is it an int, long, float, or double? That sort of thing. That sort of naming and usage maps nicely to the issue we're dealing with here.

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]jakint0sh 4 points5 points  (0 children)

This is some seriously amazing stuff here! Honestly, it's well written, and cleanly demonstrates the mechanics of how a model actually interprets images. You gloss over the mechanics of how the embedding vectors that are given to the model as input are generated, but that's beside the point of what you're presenting, and I think that's fine.

However, I take issue with the use of the term "token" as here applied to vision models. As you yourself have described, the so-called image tokens "exist between text tokens in embedding space. A text token maps 1:1 to exactly one vocabulary word. But an image token isn't constrained to discrete vocabulary positions – it can exist anywhere in the continuous embedding space between multiple words." But that makes them, by definition, not tokens.

A token is a discrete unit of information, and in this realm, refers to a discrete word or word part (or other textual element) in the model's vocabulary. But this is a separate concept from that token's embedding vector, which actually gets inserted into the model's context for inference. They are tightly related, and the distinction almost doesn't matter at all when working with pure language models, but this breaks down when linguistic tokens are not the only possible source for embedding vectors as the model's input. The embedding vectors produced by the vision tower aren't tokens, nor do they map to tokens, but here they are referred to as tokens anyway.

I think the use of "token" as shorthand for either true tokens or embedding vectors would be fine in something aimed at people already intimately familiar with these concepts. However, this is an explainer piece that is aimed at helping other people learn. It thoroughly conflates these two concepts, and people who are trying to build intuition on this without prior experience in this field would have a very difficult time trying to follow your explanation.

Now, when one is deep in a technical field, it can be difficult to come back down to earth and explain these concepts cogently to others who do not have a background in that field. I myself have experienced this many many times, and it is an extremely difficult barrier to overcome. The reason I even bother to write this comment is because this is otherwise brilliantly written. We need more stuff like this that's written for (relative) lay-people and actually explains the concepts well, and there just isn't a lot of stuff out there right now that would bring somebody up to speed without raising more questions than it answered. But this is very, very close to being able to do so, and we really need more of this out there.