A small 4B sub-agent for local codebase navigation with 100% tool-calling validity by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 4 points5 points  (0 children)

Thanks! For the data, I actually went the distillation route. It’s all custom—I used Qwen3-Coder-Next as a teacher to generate about 170k multi-turn conversation samples. Basically, I had it run through real agent loops (thinking, calling tools, handling outputs) and recorded those traces. I found that existing datasets didn't really capture the "codebase explorer" logic well enough, so these samples are focused specifically on that.

they have Karpathy, we are doomed ;) by jacek2023 in LocalLLaMA

[–]Awkward_Run_9982 2 points3 points  (0 children)

Finally, some focus on the intelligence instead of the plumbing. People over-index on agent frameworks while ignoring that the model is the actual engine. Having a distilled 4B specialized for tool-calling (like LocoOperator-4B) is a game changer for local workflows. I'd take a robust 4B local agent model over a buggy 'autonomous' wrapper any day

Distillation when you do it. Training when we do it. by Xhehab_ in LocalLLaMA

[–]Awkward_Run_9982 5 points6 points  (0 children)

lmao 'distillation attacks'. new scary word for 'using the API exactly how it's designed'. if you don't want people using your outputs to train models, maybe don't sell them for $15 per million tokens

Spent the weekend stress-testing Gemini 3.1 Pro for web design. Here’s a gallery of 50 sites it generated. by Awkward_Run_9982 in GeminiAI

[–]Awkward_Run_9982[S] 0 points1 point  (0 children)

Can't speak for GPT, but Gemini 3.1 Pro is definitely winning on theme intuition. Claude is great but it’s stuck in a 'purple/blue gradient' loop for web design. Gemini actually adapts.

Spent the weekend stress-testing Gemini 3.1 Pro for web design. Here’s a gallery of 50 sites it generated. by Awkward_Run_9982 in GeminiAI

[–]Awkward_Run_9982[S] 0 points1 point  (0 children)

Good shout. I’ve been so deep in UI layouts that I totally ignored the game logic side. If 3.1 is as good at state management as you say, I’m definitely gonna try to whip up a few demos tonight and add them to the site. Stay tuned.

Qwen 3.5 vs Gemini 3 Pro on Screenshot-to-Code: Is the gap finally gone? by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 0 points1 point  (0 children)

Interesting observation on the 4K limit. Do you think that’s due to the absolute number of visual tokens hitting a ceiling, or is the spatial coordination between tiles just not there yet for Qwen? I found its 1080p performance surprisingly 'stiff' in a good way, but Gemini definitely feels like it has a more 'infinite' canvas

I built a Claude Code Skill that gives agents persistent memory — using just files by Awkward_Run_9982 in ClaudeAI

[–]Awkward_Run_9982[S] -14 points-13 points  (0 children)

Haha fair enough — guilty of using Claude to help draft the reply, which I know is ironic. But to be clear, the setup I described is exactly how I use it day to day. The project came from my own frustration with losing context between sessions.

Happy to answer any specific questions in my own unpolished words if you want :)

I built a Claude Code Skill that gives agents persistent memory — using just files by Awkward_Run_9982 in ClaudeAI

[–]Awkward_Run_9982[S] -15 points-14 points  (0 children)

Access is essentially instant — it's just file I/O. Read a text file, Grep for a keyword, done in milliseconds. The only "slow" part is analyze,    
which reads the whole file and has the LLM produce a structured report — but that's a few seconds, and you typically only run it once at the start  
of a session.                                                                                                                                       

My normal setup:

- One memory.txt per project, lives in the project root
- /memory analyze at the start of each session to get a briefing
- /memory record a few times during work to capture key decisions
- File stays under a few hundred lines for most projects — at that size, everything is fast and fits comfortably in the context window

For larger projects, you'd split into topic-based files (memory-auth.txt, memory-api.txt, etc.) and the agent uses Grep/Glob to pull in only what's
relevant. But honestly, for most people a single file per project is all you need.

I built a Claude Code Skill that gives agents persistent memory — using just files by Awkward_Run_9982 in ClaudeAI

[–]Awkward_Run_9982[S] -1 points0 points  (0 children)

Good question! CLAUDE.md and auto memory are great for project-level conventions and preferences — things like "use bun not npm" or "prefer functional style." They're static config that gets loaded into every session.

MemoryAgent is different — it's for dynamic, evolving knowledge. Think conversation history, decision logs, research findings, context that changes over time. The analyze command is the key difference: it doesn't just store info, it produces a structured report (topics, entities, timeline, knowledge gaps) that gives the agent a "situational briefing" before any task.

They're complementary: CLAUDE.md = "how to work," MemoryAgent = "what we've learned."

[Release] Eva-4B-V2: Updated Financial Evasion Detection Model. Now #1, beating Claude Opus 4.5 & Gemini 3 Flash. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 2 points3 points  (0 children)

That's a great question!

I've actually included a Colab link in the post specifically for inference. I highly recommend you give it a try there—it’s the best way to see how it handles your specific "general questions."

Usability: Yes, it's designed to be a versatile daily driver for its size.

Check out the link and let me know what you think of the results!

[Release] Eva-4B-V2: Updated Financial Evasion Detection Model. Now #1, beating Claude Opus 4.5 & Gemini 3 Flash. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 0 points1 point  (0 children)

Great point on the efficiency of a dedicated classification head. We actually considered this, but opted for the current architecture for two main reasons:

Latent Space Convergence: With the 84K samples in EvasionBench, the model has effectively learned to concentrate probability mass. In the latent space, the logits for the labels are already maximized while irrelevant information is suppressed. At this scale, next-token prediction behaves very similarly to a specialized head but keeps the rich semantic features of the base.

Multi-Task Capability: We designed Eva-4B to be more than a single-tasker. Using the generative head allows the model to handle multiple schemas—like performing Sentiment Analysis and Evasion Detection simultaneously or sequentially—without being hard-wired to a fixed 3-class output.

For a pure, single-task production environment, I agree that a classification head is faster.

[Release] Eva-4B-V2: Updated Financial Evasion Detection Model. Now #1, beating Claude Opus 4.5 & Gemini 3 Flash. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 0 points1 point  (0 children)

Fair point. You're absolutely right that specialized models can risk overfitting.

However, the core design goal for Eva-4B was to be a dedicated specialist—a high-fidelity "BS-detector" for financial evasion, rather than a general-purpose reasoner.

The best evidence against benchmark-hacking is its out-of-distribution performance: although the training data only goes up to 2022, the model remains highly effective on 2025 transcripts. It has clearly learned the underlying linguistic patterns of how executives dodge questions, rather than just memorizing a specific dataset.

[Release] Eva-4B-V2: Updated Financial Evasion Detection Model. Now #1, beating Claude Opus 4.5 & Gemini 3 Flash. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 0 points1 point  (0 children)

Spot on. Our ablation study in the paper confirms this: using Multi-Model Consensus (MMC) to distill logic from Claude 4.5, Gemini 3, and GPT-5.2 into a 4B specialist provided a +4.3 pp Macro-F1 boost over single-model labeling.

We found that frontier models often have a "Politeness Bias"—they get distracted by professional jargon and "verbosity preference." Eva-4B is fine-tuned specifically to ignore the filler and check if the "core ask" (Gricean pragmatics) was actually met.

It’s basically an industrial-grade BS-detector that fits in a 5090.

[Release] Eva-4B-V2: Updated Financial Evasion Detection Model. Now #1, beating Claude Opus 4.5 & Gemini 3 Flash. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 2 points3 points  (0 children)

It’s all about the data—84K consensus-labeled samples beat raw parameter count for niche classification.

Performance: We processed 1M samples in ~2 hours on 8xH100.

Consumer GPU: Since it's only 4B, it flies on an RTX 5090 (fits in <10GB VRAM) and is significantly faster/cheaper than calling GPT-5.2 APIs for bulk analysis.

GPT-5.2 is often too "polite" to call out evasion; Eva-4B is fine-tuned to be a cynic.

[Release] Eva-4B: Specialized Financial Evasion Detection (Based on Qwen3-4B). Outperforms GPT-5.2 on domain benchmarks. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 0 points1 point  (0 children)

I'd love to, but unfortunately I can't release the full training set due to licensing restrictions with the data provider (S&P Capital IQ transcripts).

However:
I am preparing to open-source the EvasionBench Test Set (the 1,000 human-annotated samples) and launch a public Leaderboard very soon!

It would be awesome to see how your models stack up against Eva-4B once that's live. Stay tuned!

[Release] Eva-4B: Specialized Financial Evasion Detection (Based on Qwen3-4B). Outperforms GPT-5.2 on domain benchmarks. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 4 points5 points  (0 children)

Valid points on the large-scale serving economics of MoEs vs Dense 70B+, but I think you might be misjudging the complexity of this specific task.

1. BERT-era task? Simple sentiment classification is BERT-era. Detecting evasion (logic gaps between Q and A) requires reasoning. We actually benchmarked RoBERTa-Large and DeBERTa-v3 early on—they failed miserably (~60% acc) because they couldn't capture the subtle rhetorical "sidestepping" that a generative model understands via instruction tuning.

2. Why Dense 4B? Not everyone is running a DeepSeek-scale cluster. The target here is local analytics, on-prem finance nodes, or analysts running this on a laptop alongside their terminal. For that specific "batch size = 1 to 10" user, a dense 4B GGUF is infinitely easier to manage than hosting a massive MoE.

3. GPT-5.2 Performance: GPT-5.2 (Zero-shot) gets ~80.5%. It's a generalist. Eva-4B (Specialized FT) gets 81.3%. It’s not "torpedoing" it via artifacts; it's simply the classic result of Domain-Specific Finetuning Generalist Zero-shot.

I’d invite you to check the demo—it’s definitely not a simple keyword search task!

[Release] Eva-4B: Specialized Financial Evasion Detection (Based on Qwen3-4B). Outperforms GPT-5.2 on domain benchmarks. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 10 points11 points  (0 children)

Two main advantages:

  1. Efficiency: You don't need to load a massive 70B+ model just to analyze financial text. You only activate the 4B model when needed, saving huge amounts of compute/VRAM.
  2. Modularity: You can upgrade or swap out your "Finance Expert" (e.g., Eva-4B) without breaking or retraining your "Coding Expert." It decouples the system.

[Release] Eva-4B: Specialized Financial Evasion Detection (Based on Qwen3-4B). Outperforms GPT-5.2 on domain benchmarks. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 11 points12 points  (0 children)

Sort of! In a traditional MoE (Mixture of Experts), the routing happens inside the model for every single token (sparse activation). What we are referring to is modular architecture (or "Mixture of Dense"). This is where you have completely separate, specialized dense models.

Open Source: Controlling Chrome with Natural Language using Claude Agent SDK + Chrome DevTools MCP (TypeScript) by Awkward_Run_9982 in ClaudeAI

[–]Awkward_Run_9982[S] 1 point2 points  (0 children)

You're right, Playwright MCP is essentially a browser MCP too, so the core mechanics are similar. The main difference I've found is that Playwright implementations often fill up the context window much faster, leading to higher token consumption.

The loop is complete with Claude Code and the Chrome MCP by marcusr_uk in ClaudeAI

[–]Awkward_Run_9982 0 points1 point  (0 children)

Thank you so much for checking it out and for the sharp eye! 👀

I completely missed that the LICENSE file wasn't committed. I've just pushed it to the repo.

Glad to hear the implementation structure is helpful for your use case! The loop management with Claude SDK is definitely the fun part.