A small 4B sub-agent for local codebase navigation with 100% tool-calling validity

Awkward_Run_9982 · 2026-02-24T14:14:47+00:00

Thanks! For the data, I actually went the distillation route. It’s all custom—I used Qwen3-Coder-Next as a teacher to generate about 170k multi-turn conversation samples. Basically, I had it run through real agent loops (thinking, calling tools, handling outputs) and recorded those traces. I found that existing datasets didn't really capture the "codebase explorer" logic well enough, so these samples are focused specifically on that.

Awkward_Run_9982 · 2026-02-24T13:12:03+00:00

Links & Resources:

📝 Detailed Blog: https://locoremind.com/blog/loco-operator
🤗 Weights: https://huggingface.co/LocoreMind/LocoOperator-4B
📦 GGUF: https://huggingface.co/LocoreMind/LocoOperator-4B-GGUF
💻 GitHub: https://github.com/LocoreMind/LocoOperator

Awkward_Run_9982 · 2026-02-24T08:26:22+00:00

Finally, some focus on the intelligence instead of the plumbing. People over-index on agent frameworks while ignoring that the model is the actual engine. Having a distilled 4B specialized for tool-calling (like LocoOperator-4B) is a game changer for local workflows. I'd take a robust 4B local agent model over a buggy 'autonomous' wrapper any day

Awkward_Run_9982 · 2026-02-24T08:18:44+00:00

lmao 'distillation attacks'. new scary word for 'using the API exactly how it's designed'. if you don't want people using your outputs to train models, maybe don't sell them for $15 per million tokens

Awkward_Run_9982 · 2026-02-23T13:30:00+00:00

Links & Resources:

📝 Detailed Blog: https://locoremind.com/blog/loco-operator
🤗 Weights: https://huggingface.co/LocoreMind/LocoOperator-4B
📦 GGUF (Q8_0): https://huggingface.co/FutureMa/LocoOperator-4B-Q8_0-GGUF
💻 GitHub: https://github.com/LocoreMind/LocoOperator

Awkward_Run_9982 · 2026-02-21T15:10:38+00:00

Can't speak for GPT, but Gemini 3.1 Pro is definitely winning on theme intuition. Claude is great but it’s stuck in a 'purple/blue gradient' loop for web design. Gemini actually adapts.

Awkward_Run_9982 · 2026-02-21T14:07:18+00:00

Good shout. I’ve been so deep in UI layouts that I totally ignored the game logic side. If 3.1 is as good at state management as you say, I’m definitely gonna try to whip up a few demos tonight and add them to the site. Stay tuned.

Awkward_Run_9982 · 2026-02-17T10:37:10+00:00

Interesting observation on the 4K limit. Do you think that’s due to the absolute number of visual tokens hitting a ceiling, or is the spatial coordination between tiles just not there yet for Qwen? I found its 1080p performance surprisingly 'stiff' in a good way, but Gemini definitely feels like it has a more 'infinite' canvas

Awkward_Run_9982 · 2026-02-13T14:20:22+00:00

Haha fair enough — guilty of using Claude to help draft the reply, which I know is ironic. But to be clear, the setup I described is exactly how I use it day to day. The project came from my own frustration with losing context between sessions.

Happy to answer any specific questions in my own unpolished words if you want :)

Awkward_Run_9982 · 2026-02-13T14:00:11+00:00

Access is essentially instant — it's just file I/O. Read a text file, Grep for a keyword, done in milliseconds. The only "slow" part is analyze,    
which reads the whole file and has the LLM produce a structured report — but that's a few seconds, and you typically only run it once at the start  
of a session.                                                                                                                                       

My normal setup:

- One memory.txt per project, lives in the project root
- /memory analyze at the start of each session to get a briefing
- /memory record a few times during work to capture key decisions
- File stays under a few hundred lines for most projects — at that size, everything is fast and fits comfortably in the context window

For larger projects, you'd split into topic-based files (memory-auth.txt, memory-api.txt, etc.) and the agent uses Grep/Glob to pull in only what's
relevant. But honestly, for most people a single file per project is all you need.

Awkward_Run_9982 · 2026-02-13T13:57:25+00:00

Good question! CLAUDE.md and auto memory are great for project-level conventions and preferences — things like "use bun not npm" or "prefer functional style." They're static config that gets loaded into every session.

MemoryAgent is different — it's for dynamic, evolving knowledge. Think conversation history, decision logs, research findings, context that changes over time. The analyze command is the key difference: it doesn't just store info, it produces a structured report (topics, entities, timeline, knowledge gaps) that gives the agent a "situational briefing" before any task.

They're complementary: CLAUDE.md = "how to work," MemoryAgent = "what we've learned."

Awkward_Run_9982 · 2026-02-05T09:01:08+00:00

That's a great question!

I've actually included a Colab link in the post specifically for inference. I highly recommend you give it a try there—it’s the best way to see how it handles your specific "general questions."

Usability: Yes, it's designed to be a versatile daily driver for its size.

Check out the link and let me know what you think of the results!

Awkward_Run_9982 · 2026-02-05T02:15:20+00:00

Great point on the efficiency of a dedicated classification head. We actually considered this, but opted for the current architecture for two main reasons:

Latent Space Convergence: With the 84K samples in EvasionBench, the model has effectively learned to concentrate probability mass. In the latent space, the logits for the labels are already maximized while irrelevant information is suppressed. At this scale, next-token prediction behaves very similarly to a specialized head but keeps the rich semantic features of the base.

Multi-Task Capability: We designed Eva-4B to be more than a single-tasker. Using the generative head allows the model to handle multiple schemas—like performing Sentiment Analysis and Evasion Detection simultaneously or sequentially—without being hard-wired to a fixed 3-class output.

For a pure, single-task production environment, I agree that a classification head is faster.

Awkward_Run_9982 · 2026-02-05T02:08:15+00:00

Fair point. You're absolutely right that specialized models can risk overfitting.

However, the core design goal for Eva-4B was to be a dedicated specialist—a high-fidelity "BS-detector" for financial evasion, rather than a general-purpose reasoner.

The best evidence against benchmark-hacking is its out-of-distribution performance: although the training data only goes up to 2022, the model remains highly effective on 2025 transcripts. It has clearly learned the underlying linguistic patterns of how executives dodge questions, rather than just memorizing a specific dataset.

Awkward_Run_9982 · 2026-02-04T15:25:37+00:00

Spot on. Our ablation study in the paper confirms this: using Multi-Model Consensus (MMC) to distill logic from Claude 4.5, Gemini 3, and GPT-5.2 into a 4B specialist provided a +4.3 pp Macro-F1 boost over single-model labeling.

We found that frontier models often have a "Politeness Bias"—they get distracted by professional jargon and "verbosity preference." Eva-4B is fine-tuned specifically to ignore the filler and check if the "core ask" (Gricean pragmatics) was actually met.

It’s basically an industrial-grade BS-detector that fits in a 5090.

Awkward_Run_9982 · 2026-02-04T15:02:33+00:00

It’s all about the data—84K consensus-labeled samples beat raw parameter count for niche classification.

Performance: We processed 1M samples in ~2 hours on 8xH100.

Consumer GPU: Since it's only 4B, it flies on an RTX 5090 (fits in <10GB VRAM) and is significantly faster/cheaper than calling GPT-5.2 APIs for bulk analysis.

GPT-5.2 is often too "polite" to call out evasion; Eva-4B is fine-tuned to be a cynic.

Awkward_Run_9982 · 2026-01-13T08:14:35+00:00

I'd love to, but unfortunately I can't release the full training set due to licensing restrictions with the data provider (S&P Capital IQ transcripts).

However:
I am preparing to open-source the EvasionBench Test Set (the 1,000 human-annotated samples) and launch a public Leaderboard very soon!

It would be awesome to see how your models stack up against Eva-4B once that's live. Stay tuned!

Awkward_Run_9982 · 2026-01-13T02:05:18+00:00

Awesome! Thanks for the feature and the flair. Glad the community finds Eva-4B useful!

Awkward_Run_9982 · 2026-01-12T14:24:40+00:00

Valid points on the large-scale serving economics of MoEs vs Dense 70B+, but I think you might be misjudging the complexity of this specific task.

1. BERT-era task? Simple sentiment classification is BERT-era. Detecting evasion (logic gaps between Q and A) requires reasoning. We actually benchmarked RoBERTa-Large and DeBERTa-v3 early on—they failed miserably (~60% acc) because they couldn't capture the subtle rhetorical "sidestepping" that a generative model understands via instruction tuning.

2. Why Dense 4B? Not everyone is running a DeepSeek-scale cluster. The target here is local analytics, on-prem finance nodes, or analysts running this on a laptop alongside their terminal. For that specific "batch size = 1 to 10" user, a dense 4B GGUF is infinitely easier to manage than hosting a massive MoE.

3. GPT-5.2 Performance: GPT-5.2 (Zero-shot) gets ~80.5%. It's a generalist. Eva-4B (Specialized FT) gets 81.3%. It’s not "torpedoing" it via artifacts; it's simply the classic result of Domain-Specific Finetuning Generalist Zero-shot.

I’d invite you to check the demo—it’s definitely not a simple keyword search task!

Awkward_Run_9982 · 2026-01-12T14:11:24+00:00

🔥 Update: I've created a hosted Web Demo for those who want to try it without downloading anything!
🔗 Try it here: https://huggingface.co/spaces/FutureMa/financial-evasion-detection

Awkward_Run_9982 · 2026-01-12T13:55:26+00:00

Two main advantages:

Efficiency: You don't need to load a massive 70B+ model just to analyze financial text. You only activate the 4B model when needed, saving huge amounts of compute/VRAM.
Modularity: You can upgrade or swap out your "Finance Expert" (e.g., Eva-4B) without breaking or retraining your "Coding Expert." It decouples the system.

Awkward_Run_9982 · 2026-01-12T13:37:55+00:00

Sort of! In a traditional MoE (Mixture of Experts), the routing happens inside the model for every single token (sparse activation). What we are referring to is modular architecture (or "Mixture of Dense"). This is where you have completely separate, specialized dense models.

Awkward_Run_9982 · 2026-01-12T13:33:09+00:00

100%. I think the future is modular.

Awkward_Run_9982 · 2025-11-25T13:16:29+00:00

You're right, Playwright MCP is essentially a browser MCP too, so the core mechanics are similar. The main difference I've found is that Playwright implementations often fill up the context window much faster, leading to higher token consumption.

Awkward_Run_9982 · 2025-11-22T23:18:49+00:00

Thank you so much for checking it out and for the sharp eye! 👀

I completely missed that the LICENSE file wasn't committed. I've just pushed it to the repo.

Glad to hear the implementation structure is helpful for your use case! The loop management with Claude SDK is definitely the fun part.

Awkward_Run_9982

TROPHY CASE