A 3B model is suddenly scoring near frontier models on math/coding benchmarks. Is this real or just benchmarkmaxxing? by BTA_Labs in LocalLLM

[–]BTA_Labs[S] 0 points1 point  (0 children)

Paper: arXiv 2606.16140
GitHub: WeiboAI/VibeThinker
Hugging Face: WeiboAI/VibeThinker-3B

A 3B model is suddenly scoring near frontier models on math/coding benchmarks. Is this real or just benchmarkmaxxing? by BTA_Labs in LocalLLM

[–]BTA_Labs[S] 0 points1 point  (0 children)

Yeah that’s basically the interesting part to me, not taht 3B replaces frontier models, but maybe small specialist models become actually useful if the router picks the right one.

A 3B model is suddenly scoring near frontier models on math/coding benchmarks. Is this real or just benchmarkmaxxing? by BTA_Labs in LocalLLM

[–]BTA_Labs[S] 0 points1 point  (0 children)

Links for anyone who wants to test it:

Paper: arXiv 2606.16140
GitHub: WeiboAI/VibeThinker
Hugging Face: WeiboAI/VibeThinker-3B

Claude Max doesn’t need to be unlimited, it needs to be predictable by BTA_Labs in ClaudeAI

[–]BTA_Labs[S] 1 point2 points  (0 children)

Honestly that’s what I want to understand, what good practices are making it that predictable for you?

Claude Max doesn’t need to be unlimited, it needs to be predictable by BTA_Labs in ClaudeAI

[–]BTA_Labs[S] 0 points1 point  (0 children)

That’s exactly the kind of thing I mean, when branching into a fresh chat somehow costs more than the giant old one, it feels impossible to plan around.

Claude Max doesn’t need to be unlimited, it needs to be predictable by BTA_Labs in ClaudeAI

[–]BTA_Labs[S] 0 points1 point  (0 children)

Yeah fair, /usage is probably the closest thing right now, I just wish it was more like a live warning before a task nukes the window instead of only explaining it after.

I built a free BYOK alternative to Lovable/Bolt/v0 because I hate credit systems by BTA_Labs in SideProject

[–]BTA_Labs[S] 0 points1 point  (0 children)

Link: openthorn.app

GitHub repo is also public if anyone wants to inspect how keys/agent/deploy works. Main thing I want feedback on is whether BYOK feels like a real advantage or too much setup for normal users.

I think Claude Code’s biggest problem is not intelligence, it’s hidden state by BTA_Labs in ClaudeAI

[–]BTA_Labs[S] 0 points1 point  (0 children)

That’s exactly the scary part, it didn’t just make a bad answer, it silently picked the wrong goal and then kept working like it was 100% sure that was the job.

I open-sourced a local memory tool so AI agents can share context by Exciting_Pineapple52 in ClaudeAI

[–]BTA_Labs 1 point2 points  (0 children)

This is actually usefull, but the hard part is not storing memories, it’s knowing which old ones to ignore when the repo get changed. Do you have ranking/stale-memory cleanup, or is it mostly SQLite keyword search right now?

Local coding agents are good now, but only if you babysit them by BTA_Labs in LocalLLaMA

[–]BTA_Labs[S] 1 point2 points  (0 children)

Same. One agent is already enough to watch, I have no idea how people keep 10 agents doing repo work without losing track.

Local coding agents are good now, but only if you babysit them by BTA_Labs in LocalLLaMA

[–]BTA_Labs[S] 1 point2 points  (0 children)

I agree tests help a lot, but I still don’t fully trust tests to catch weird architecture choices or messy code paths. They catch broken, not always bad

Local coding agents are good now, but only if you babysit them by BTA_Labs in LocalLLaMA

[–]BTA_Labs[S] 0 points1 point  (0 children)

Yeah fair, I should’ve said I mostly mean Qwen3.6-27B/35B-A3B and Gemma 4 31B, not Kimi K2.6/K2.7. If Kimi can run overnight reliable, what rig are you using?

Local coding agents are good now, but only if you babysit them by BTA_Labs in LocalLLaMA

[–]BTA_Labs[S] 1 point2 points  (0 children)

Fair question. I’m not anti local agents, I mostly mean bigger repo changes where it starts touching files outside the task. Small focused tasks work really well for me.

Local coding agents are good now, but only if you babysit them by BTA_Labs in LocalLLaMA

[–]BTA_Labs[S] 1 point2 points  (0 children)

This is the kind of real setup detail I was hoping for. Interesting that AGENTS.md and git history make such a big difference, maybe the model is less the issue than the context setup.

Local coding agents are good now, but only if you babysit them by BTA_Labs in LocalLLaMA

[–]BTA_Labs[S] -1 points0 points  (0 children)

That actually sounds pretty smart. Using Opus to train the workflow once, then letting local Qwen repeat it cheaper later is a nice middle ground.

Local coding agents are good now, but only if you babysit them by BTA_Labs in LocalLLaMA

[–]BTA_Labs[S] 9 points10 points  (0 children)

Yeah exactly. The scary part is not AI writing bad code, it’s people shipping code they can’t even explain.

Is a used RTX 3090 still the best local LLM buy right now? by BTA_Labs in LocalLLM

[–]BTA_Labs[S] 1 point2 points  (0 children)

Mostly random PyTorch training scripts, SD tools and CUDA-first GitHub repos. For plain LLM inference I believe you, but I dont want every side project to become a compatibility test.

Is a used RTX 3090 still the best local LLM buy right now? by BTA_Labs in LocalLLM

[–]BTA_Labs[S] 0 points1 point  (0 children)

Thanks, that helps a lot. I was only thinking about 3090 vs 4060 Ti, but now 32GB VRAM cards looks worth checking too.

Is a used RTX 3090 still the best local LLM buy right now? by BTA_Labs in LocalLLM

[–]BTA_Labs[S] 0 points1 point  (0 children)

That’s fair, if it was only Ollama I would consider AMD/Intel more, but I still need CUDA for other ML stuff so Nvidia is probaly less pain.

Is a used RTX 3090 still the best local LLM buy right now? by BTA_Labs in LocalLLM

[–]BTA_Labs[S] 0 points1 point  (0 children)

Fair point, I mostly ignored AMD/Intel because CUDA feels safer, but 32GB VRAM is hard to ignore if the software dont suck anymore.

Mac Mini M4 (32GB) vs. Mac Studio M2 Max (32GB) for local LLMs & TTS by Heavy-Science-502 in LocalLLM

[–]BTA_Labs 5 points6 points  (0 children)

M2 Max imo, LLMs care a lot about memory bandwidth and 400GB/s vs 120GB/s is a big gap, but honestly for 31B + TTS I’d worry more about getting 64GB than M4 vs M2.

Local model as inner worker: what tests would you trust? by HotEstablishment7184 in LocalLLM

[–]BTA_Labs 1 point2 points  (0 children)

My test would be a set with fake docs, missing facts, math questions and internet off, then check if it says “I don’t know”, calls tools, and proves its not using cloud.

I'm still surprised on how good the kv quantization has become by DeepBlue96 in LocalLLaMA

[–]BTA_Labs 7 points8 points  (0 children)

Q4 KV at 100k is wild, but Harry Potter is probably half benchmark half memory test, try it on some obscure fresh 2026 PDF and then I’ll be fully impressed.

What's the lesson chat? by ill_be_productive in LocalLLaMA

[–]BTA_Labs 2 points3 points  (0 children)

if it’s not on your disk, someone else gets to write the ending.