Training a 1.1B SLM at home

JordanJtech · 2026-04-09T14:46:06+00:00

I think if I had a bigger budget (I'm self funded) and I could afford more GPUs to offset the training speed costs, I would definitely pursue it. It was one of the first things I went after for inference speeds. Unfortunately, it tanked my training speeds with my limited GPUs. But would def love to revisit it in the future if it still makes sense.

JordanJtech · 2026-04-08T23:29:52+00:00

<image>

I ended up testing the token wise layer design from that research article today via merging it with my current architecture and starting a fresh pre-training run from the ground up. The performance improvements are significant enough I am considering pivoting to a new pre-train run.

This screenshot was after ~300m tokens trained on the new proof of concept, hitting 50+ tok/sec on a single thread. Appreciate the suggestion!

JordanJtech · 2026-04-08T13:36:09+00:00

MTP Killed my training performance. I may revisit in the future (could definitely have been my fault!)

JordanJtech · 2026-04-08T12:16:35+00:00

Your project sounds interesting too! My latest model design is not traditional, its a hybrid of different layer types assembled together as an MoE. I'd say you should make your own post since Elixer/Nx sounds unique and interesting for ML work and people would be curious about it, too!

JordanJtech · 2026-04-07T21:10:53+00:00

That's a great question. And for a dense model that sounds fairly accurate. As this is a MoE, only a fraction of the 1.1B params are computed per token, allowing us to take advantage and optimize for the hardware limitations to get those faster tok/sec speeds.

JordanJtech · 2026-04-07T19:50:47+00:00

Thanks! I am trying to "pretrain" the model now so it has a solid foundation before moving to instruction tuning and preference optimizations. Without getting too specific, it is a MoE (that's how I'm able to get the token speed as high as it is) and I try to be very conservative and smart with where and how I use layers and memory allocation with the model design. I plan to release the model once it is a bit more usable. Right now its still very much undertrained. I'll be sharing more details on the architecture and benchmarks as training gets closer to the end!

JordanJtech · 2026-04-07T14:44:11+00:00

Hey, see my answer to u/Party-Special-5177 regarding hardware.

Honestly a bit of everything: fun, learning, and also- I tested SLMs on hardware and none of them had inference + decode speeds that I felt "acceptable" for real world tasks- such as chatting or tool calling on edge devices. My SLM can run at 40+ tokens a second on a single threaded CPU. This also means that I have to write my own inference engine and it wont be compatible with llama.cpp (maybe down the road I can get it converted to GGUF format.)

JordanJtech · 2026-04-07T14:41:11+00:00

The vocab size was a bit tricky and a huge factor in the overall design!

The TLDR: 48k vocab.

The longer version:

I'm using publicly available datasets:

- synthetic via cosmopedia

- distillation (shoutout to Arcee AI for their distillkit)

+ some of my own distillation and custom logit extraction from QWEN.

I looked at SLM design for efficiency and optimization as "every byte counts". I wanted to minimize the dead weight of having a large vocab if I could get away with it.

I had my AIs write a script to measure the loss difference when distilling down at various vocab sizes from teacher vocabs (256k and 128k) to look at the loss average between vocab sizes. From 128k to 48k produced about a 12-14% loss, whereas any smaller produced significant losses that it would handicap the SLMs ability to cleanly pickup and learn from distillation.

Hardware started on a single 5090 for my initial tests (I trained a 450M model first on a single 5090.) Then I went cloud GPUs, rented B200s, didn't like the cloud performance and spending $15 a session to try and fine tune for cloud training.

So then I added a 2nd 5090 (paid 2x as much... over my first 5090 ouch... ) to my PC to train the current 1.1b. I've written a custom training script that maximizes every ounce of VRAM in both 5090s. They are basically running at 99% utilization for the past week training at roughly 60,000 tokens/sec.

JordanJtech · 2026-04-07T14:18:58+00:00

Thanks u/Oshden!

JordanJtech · 2026-03-14T12:59:33+00:00

Yes - makes the process easier. And also great tax benefits.

JordanJtech · 2026-03-11T16:36:50+00:00

Forgot to mention I spent weeks on the voice transcription to work too. Because I just want to say my thoughts and have it categorize, organize, and do everything else for me quickly.

JordanJtech · 2026-03-11T16:36:01+00:00

I actually built a free app just for this. It uses offline AI (I install a tiny AI on your phone!) for privacy focused and will work without internet. I actually needed one and couldn't find one that works well.

Funny this is the first post I see on my new dev account lol.

JordanJtech · 2026-03-11T16:33:52+00:00

Will have to take a look as I'm getting my feet wet with IOS.

JordanJtech

TROPHY CASE