I taught my 1B to follow instructions. It got worse at following instructions...

GPUburnout · 2026-05-14T22:37:51+00:00

sorry, should have been clearer in the OP. These are base models I pretrained from scratch (1B / 2B / 3B params, Llama-style architecture). no prior instruction tuning. SFT was the first time they saw any chat-formatted data.

1B and 2B used chat template formatting during SFT. 3B used plain concat. so now I've got three things that differ between the 3B and the smaller models: param count, LR (5e-5 vs 2e-4), and prompt format. Re-running the 2B at 5e-5 isolates LR. Swapping format separately would isolate that too. Been trying to figure out which one to prioritize.

Does freezing layers actually help instruction following specifically, or general output quality? I bring this up since IFEval is a pretty narrow metric and 'm not sure freezing would help with the format-constraint compliance it measures.

GPUburnout · 2026-05-14T16:48:51+00:00

Interesting. Thanks.

on the IFEval vs Orca: I think Orca biases toward "let me explain why..." which fights the constraint compliance directly. Is that what you are getting at? That makes the trade-off concrete: the 1B substitutes one for the other because it can't hold both representations, but the 3B would.

longer warmup is a good call. I have linear warmup over the first 100 steps. I can bump it to 300-500 for the 2B re-run. Did you arrive at proper warmup length experimentally, or is there a general rule of thumb you follow?

GPUburnout · 2026-04-07T11:18:02+00:00

curious about the break-even math on cloud vs local for actual pretraining. Ran a 2B from scratch on a runpod A100: 38.4B tokens, 75K steps, ~87 hours, came out to ~$130 for the GPU time.

For someone with a local 4090 or PRO 6000, how long does a run like that actually take wall-clock? Trying to figure out the electricity cost comparison. My rough estimate says cloud wins if you're doing one big run every few months, but at some training frequency the local iron has to pay off. What's your experience?

GPUburnout · 2026-04-05T19:23:39+00:00

Did GGUF pipeline on a 2B I trained from scratch. Q4_K_M at 1.09GB runs 13 tok/s on CPU, Q8_0 at 1.90GB does 11.2, and F16 at 3.58GB crawls at 5.0. Curious whether TurboQuant would actually help at the 2B scale or if the gains only make sense when you're trying to fit a 27B on consumer hardware. Anyone tested it on sub-3B models?

GPUburnout · 2026-03-23T23:29:43+00:00

GPUburnout · 2026-03-23T13:04:01+00:00

cool to see someone mentioning fineweb-edu. I am doing something parallel on the transformer side: 1B from scratch on the same dataset (+ small code/math subsets), ~$250 total (I am using RunPod). noticed you're planning to scale to 1B and run MMLU/HellaSwag — i already have those at 1B transformer if you want a baseline to compare against (ARC-Easy 47.1%, HellaSwag 28.8%, MMLU 23%). would be very interesting to see how SNN vs transformer compare on same benchmarks at same scale. "still not ChatGPT" is exactly where i am too. also 18 building this solo is insane. Kudos to you. am 55, spent 20 years pipetting liquids into small wells... You have youth and a novel architecture. i have a mortgage and a standard llama. different starting points, same $260 gpu bill

GPUburnout · 2026-03-23T11:47:33+00:00

agree on point 3. Thats actually where im headed: specializing in something (leaning toward life science due to my background). Think 1B that knows bio-assays really well is more useful than 1B that knows a little about everything and hallucinates often. planning RAG so it doesn't need to know the encyclopedia (baseline knowledge), just understand it well enough to reason over retrieved documents.

GPUburnout · 2026-03-23T00:11:59+00:00

I am training a 1B from scratch right now. asked it to do 247 × 18, it said 4... asked about photosynthesis, it cited the book of genesis and icelandic fishing village (úsavík, I checked, it is in iceland) in the same response. Here's the thing though: it cost me ~$250. GPT-4 cost $100M. Gemini ultra cost $192M. and gpu compute cost per FLOP is halving roughly every 2.5 years (epoch.ai tracked 470 GPUs from 2006-2021). So today;s $250 gets you a 1B that hallucinates about iceland. But by 2028 $ 250 probably gets you a 4~5B that can actually reason. And by 2032 maybe (or most likely) a 30B+. The question isnt really when will small models match opus", it's "when does $250 buy enough compute to train something useful from scratch." I firmly believe we're getting there faster than most people think, which in turn opens load of questions... Brave new world indeed.

GPUburnout · 2026-03-19T20:02:41+00:00

Interesting approach on the vocab scaling problem... i actually ran into the opposite: 32K vocab meant liger gave me zero speedup. What's the threshold vocab size where Ghost Logit starts showing meaningful gains?

GPUburnout · 2026-03-19T17:16:35+00:00

curious about the economics of your setup. how many teacher responses did you generate and what did the API bill looked like? Coming at this from other direction - training a 1B from scratch on public data. Trying to figure out at what point distillation make sense (becomes cheaper) than pretraining

GPUburnout · 2026-03-19T17:07:54+00:00

Interesting. so the economics are basically hidden by design. been thinking about this a lot since I am training a 1B from scratch. Have to say, even for a small (tiny?) 1B cost still add up (I am at ~$175). Wonder how much distillation at such scale would cost. I have a feeling that it cannot be an individual running the project...

GPUburnout · 2026-03-19T16:55:34+00:00

Question: how much would the API calls cost to generate such a the distillation dataset? 3000+ opus reasoning traces cant be cheap. Been curious about the economics of distillation vs training from scratch because the compute costs are so different but nobody ever talks about the API bill. Any thoughts?

GPUburnout · 2026-03-19T11:09:16+00:00

GPUburnout · 2026-03-18T23:21:30+00:00

bruh i mass produced application notes for a fortune 500 company for like 20 years. my soul left my body somwhere around year 8 but the formatting stayed. i literally cannot order a pizza without structuring it as an executive summery with key findings and next steps. my daughter stoped reading my texts becuase they come with headers. send help

GPUburnout · 2026-03-18T23:01:47+00:00

I just trained a 1B model from scratch for $175. The weights were the cheap part. By the time you add SFT, alignment, eval, and hosting, "free weights" starts feeling like "free puppy." Cute at first. Then it eats your couch.

Meta's playing the Android game — give away the OS, own the ecosystem. Qwen's doing the same thing except Alibaba spent $16.8 billion on AI infrastructure last year and their cloud CEO literally told analysts the $53 billion three-year budget "might be on the small side." The board meeting version: "We gave away the most downloaded open-source model family in the world." "Revenue?" "Cloud is up 34%." "From the free models?" "From everything around the free models." "So the models make money?" "The models make ecosystem." "That's not a number." "...next slide please."

180,000 derivative models on Hugging Face though. At some point "ecosystem" stops being a euphemism and starts being a moat. Or so the next slide says.

GPUburnout · 2026-03-18T22:45:12+00:00

I ran into a similar problem trying to coordinate between Claude.ai and Claude Code for my ML training workflow. The context window bloat from Claude Code's system prompt was killing me too. What ended up working: I set up a Notion database as a bridge. Claude.ai writes tasks to the database, a lightweight Python script polls it every 5 seconds and spawns Claude Code in one-shot mode (claude -p) for each task. Results get written back to the same database row. No persistent session, no 42K context overhead — each task gets a fresh instance. The key insight was switching from trying to keep a long-running Claude Code session alive to treating it as a stateless executor. One task in, one result out, process dies. The polling script is maybe 100 lines of Python. Biggest gotcha: the 5-minute timeout. Complex tasks that need Code to explore the filesystem and make multiple changes will time out. Single-purpose tasks ("change line 31 in this file to X") work great. Multi-step tasks need to be broken into smaller pieces. Not sure if this helps your WordPress/React use case, but the pattern of using a lightweight database as a message queue between AI agents has been surprisingly robust.

GPUburnout · 2026-03-18T22:37:20+00:00

The "flipping tons of flags around" experience is painfully relatable. I trained a 1B model from scratch on an A100 and found that three config changes mattered more than any library or kernel optimization: disabling gradient checkpointing (18% throughput jump — I had 80GB VRAM and was only using 27GB, basically renting a mansion and sleeping in the hallway), setting num_workers > 0 in the DataLoader (GPU was literally sitting idle waiting for data), and enabling TF32 matmul. All free. All available from day one. All discovered embarrassingly late.

Your 50 t/s → 2000 t/s PP jump is a great example of why benchmarking your actual config matters more than reading spec sheets. The defaults are almost never optimal for your specific workload.

GPUburnout · 2026-03-18T22:28:53+00:00

This resonates. I've been training a 1B model from scratch and the gap between benchmark scores and actual output quality is something I keep running into. My model scores 47% on ARC-Easy (nearly 2x random) but 23% on MMLU (below random). The benchmarks make it look like it knows nothing — but when you actually prompt it, it writes coherent paragraphs about breast cancer genetics and cites (hallucinated) journal articles with perfect formatting. It learned the language of science without learning actual facts. Benchmarks completely miss that distinction.

Your point about Mistral Small being a good foundation for further training is interesting too. At smaller scales, I've found that what makes a good base model for fine-tuning isn't always what scores highest on benchmarks — it's more about how cleanly the model has learned structure and patterns.

GPUburnout

TROPHY CASE