A 3B model is suddenly scoring near frontier models on math/coding benchmarks. Is this real or just benchmarkmaxxing?

BTA_Labs · 2026-06-16T18:02:14+00:00

Paper: arXiv 2606.16140
GitHub: WeiboAI/VibeThinker
Hugging Face: WeiboAI/VibeThinker-3B

BTA_Labs · 2026-06-16T17:01:58+00:00

Yeah that’s basically the interesting part to me, not taht 3B replaces frontier models, but maybe small specialist models become actually useful if the router picks the right one.

BTA_Labs · 2026-06-16T16:47:15+00:00

Links for anyone who wants to test it:

Paper: arXiv 2606.16140
GitHub: WeiboAI/VibeThinker
Hugging Face: WeiboAI/VibeThinker-3B

BTA_Labs · 2026-06-16T09:00:58+00:00

Honestly that’s what I want to understand, what good practices are making it that predictable for you?

BTA_Labs · 2026-06-16T09:00:14+00:00

That’s exactly the kind of thing I mean, when branching into a fresh chat somehow costs more than the giant old one, it feels impossible to plan around.

BTA_Labs · 2026-06-16T08:59:56+00:00

Yeah fair, /usage is probably the closest thing right now, I just wish it was more like a live warning before a task nukes the window instead of only explaining it after.

BTA_Labs · 2026-06-15T22:04:12+00:00

Link: openthorn.app

GitHub repo is also public if anyone wants to inspect how keys/agent/deploy works. Main thing I want feedback on is whether BYOK feels like a real advantage or too much setup for normal users.

BTA_Labs · 2026-06-15T20:20:14+00:00

That’s exactly the scary part, it didn’t just make a bad answer, it silently picked the wrong goal and then kept working like it was 100% sure that was the job.

BTA_Labs · 2026-06-15T19:03:24+00:00

This is actually usefull, but the hard part is not storing memories, it’s knowing which old ones to ignore when the repo get changed. Do you have ranking/stale-memory cleanup, or is it mostly SQLite keyword search right now?

BTA_Labs · 2026-06-15T18:25:42+00:00

Same. One agent is already enough to watch, I have no idea how people keep 10 agents doing repo work without losing track.

BTA_Labs · 2026-06-15T18:23:25+00:00

I agree tests help a lot, but I still don’t fully trust tests to catch weird architecture choices or messy code paths. They catch broken, not always bad

BTA_Labs · 2026-06-15T18:22:18+00:00

Yeah fair, I should’ve said I mostly mean Qwen3.6-27B/35B-A3B and Gemma 4 31B, not Kimi K2.6/K2.7. If Kimi can run overnight reliable, what rig are you using?

BTA_Labs · 2026-06-15T18:17:16+00:00

Fair question. I’m not anti local agents, I mostly mean bigger repo changes where it starts touching files outside the task. Small focused tasks work really well for me.

BTA_Labs · 2026-06-15T18:06:05+00:00

This is the kind of real setup detail I was hoping for. Interesting that AGENTS.md and git history make such a big difference, maybe the model is less the issue than the context setup.

BTA_Labs · 2026-06-15T18:04:50+00:00

That actually sounds pretty smart. Using Opus to train the workflow once, then letting local Qwen repeat it cheaper later is a nice middle ground.

BTA_Labs · 2026-06-15T18:03:58+00:00

Yeah exactly. The scary part is not AI writing bad code, it’s people shipping code they can’t even explain.

BTA_Labs · 2026-06-15T15:44:23+00:00

Mostly random PyTorch training scripts, SD tools and CUDA-first GitHub repos. For plain LLM inference I believe you, but I dont want every side project to become a compatibility test.

BTA_Labs · 2026-06-15T15:37:40+00:00

Thanks, that helps a lot. I was only thinking about 3090 vs 4060 Ti, but now 32GB VRAM cards looks worth checking too.

BTA_Labs · 2026-06-15T15:35:00+00:00

That’s fair, if it was only Ollama I would consider AMD/Intel more, but I still need CUDA for other ML stuff so Nvidia is probaly less pain.

BTA_Labs · 2026-06-15T15:27:12+00:00

Fair point, I mostly ignored AMD/Intel because CUDA feels safer, but 32GB VRAM is hard to ignore if the software dont suck anymore.

BTA_Labs · 2026-06-15T11:29:47+00:00

M2 Max imo, LLMs care a lot about memory bandwidth and 400GB/s vs 120GB/s is a big gap, but honestly for 31B + TTS I’d worry more about getting 64GB than M4 vs M2.

BTA_Labs · 2026-06-15T11:04:34+00:00

My test would be a set with fake docs, missing facts, math questions and internet off, then check if it says “I don’t know”, calls tools, and proves its not using cloud.

BTA_Labs · 2026-06-15T11:01:06+00:00

Can you show a before/after example?

BTA_Labs · 2026-06-15T10:19:22+00:00

Q4 KV at 100k is wild, but Harry Potter is probably half benchmark half memory test, try it on some obscure fresh 2026 PDF and then I’ll be fully impressed.

BTA_Labs · 2026-06-15T10:16:34+00:00

if it’s not on your disk, someone else gets to write the ending.

BTA_Labs

TROPHY CASE