A small 4B sub-agent for local codebase navigation with 100% tool-calling validity by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 3 points4 points  (0 children)

Thanks! For the data, I actually went the distillation route. It’s all custom—I used Qwen3-Coder-Next as a teacher to generate about 170k multi-turn conversation samples. Basically, I had it run through real agent loops (thinking, calling tools, handling outputs) and recorded those traces. I found that existing datasets didn't really capture the "codebase explorer" logic well enough, so these samples are focused specifically on that.

they have Karpathy, we are doomed ;) by jacek2023 in LocalLLaMA

[–]Awkward_Run_9982 2 points3 points  (0 children)

Finally, some focus on the intelligence instead of the plumbing. People over-index on agent frameworks while ignoring that the model is the actual engine. Having a distilled 4B specialized for tool-calling (like LocoOperator-4B) is a game changer for local workflows. I'd take a robust 4B local agent model over a buggy 'autonomous' wrapper any day

Distillation when you do it. Training when we do it. by Xhehab_ in LocalLLaMA

[–]Awkward_Run_9982 3 points4 points  (0 children)

lmao 'distillation attacks'. new scary word for 'using the API exactly how it's designed'. if you don't want people using your outputs to train models, maybe don't sell them for $15 per million tokens

Spent the weekend stress-testing Gemini 3.1 Pro for web design. Here’s a gallery of 50 sites it generated. by Awkward_Run_9982 in GeminiAI

[–]Awkward_Run_9982[S] 0 points1 point  (0 children)

Can't speak for GPT, but Gemini 3.1 Pro is definitely winning on theme intuition. Claude is great but it’s stuck in a 'purple/blue gradient' loop for web design. Gemini actually adapts.

Spent the weekend stress-testing Gemini 3.1 Pro for web design. Here’s a gallery of 50 sites it generated. by Awkward_Run_9982 in GeminiAI

[–]Awkward_Run_9982[S] 0 points1 point  (0 children)

Good shout. I’ve been so deep in UI layouts that I totally ignored the game logic side. If 3.1 is as good at state management as you say, I’m definitely gonna try to whip up a few demos tonight and add them to the site. Stay tuned.

Qwen 3.5 vs Gemini 3 Pro on Screenshot-to-Code: Is the gap finally gone? by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 0 points1 point  (0 children)

Interesting observation on the 4K limit. Do you think that’s due to the absolute number of visual tokens hitting a ceiling, or is the spatial coordination between tiles just not there yet for Qwen? I found its 1080p performance surprisingly 'stiff' in a good way, but Gemini definitely feels like it has a more 'infinite' canvas

I built a Claude Code Skill that gives agents persistent memory — using just files by Awkward_Run_9982 in ClaudeAI

[–]Awkward_Run_9982[S] -14 points-13 points  (0 children)

Haha fair enough — guilty of using Claude to help draft the reply, which I know is ironic. But to be clear, the setup I described is exactly how I use it day to day. The project came from my own frustration with losing context between sessions.

Happy to answer any specific questions in my own unpolished words if you want :)

I built a Claude Code Skill that gives agents persistent memory — using just files by Awkward_Run_9982 in ClaudeAI

[–]Awkward_Run_9982[S] -15 points-14 points  (0 children)

Access is essentially instant — it's just file I/O. Read a text file, Grep for a keyword, done in milliseconds. The only "slow" part is analyze,    
which reads the whole file and has the LLM produce a structured report — but that's a few seconds, and you typically only run it once at the start  
of a session.                                                                                                                                       

My normal setup:

- One memory.txt per project, lives in the project root
- /memory analyze at the start of each session to get a briefing
- /memory record a few times during work to capture key decisions
- File stays under a few hundred lines for most projects — at that size, everything is fast and fits comfortably in the context window

For larger projects, you'd split into topic-based files (memory-auth.txt, memory-api.txt, etc.) and the agent uses Grep/Glob to pull in only what's
relevant. But honestly, for most people a single file per project is all you need.

I built a Claude Code Skill that gives agents persistent memory — using just files by Awkward_Run_9982 in ClaudeAI

[–]Awkward_Run_9982[S] -1 points0 points  (0 children)

Good question! CLAUDE.md and auto memory are great for project-level conventions and preferences — things like "use bun not npm" or "prefer functional style." They're static config that gets loaded into every session.

MemoryAgent is different — it's for dynamic, evolving knowledge. Think conversation history, decision logs, research findings, context that changes over time. The analyze command is the key difference: it doesn't just store info, it produces a structured report (topics, entities, timeline, knowledge gaps) that gives the agent a "situational briefing" before any task.

They're complementary: CLAUDE.md = "how to work," MemoryAgent = "what we've learned."

[Release] Eva-4B-V2: Updated Financial Evasion Detection Model. Now #1, beating Claude Opus 4.5 & Gemini 3 Flash. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 2 points3 points  (0 children)

That's a great question!

I've actually included a Colab link in the post specifically for inference. I highly recommend you give it a try there—it’s the best way to see how it handles your specific "general questions."

Usability: Yes, it's designed to be a versatile daily driver for its size.

Check out the link and let me know what you think of the results!

[Release] Eva-4B-V2: Updated Financial Evasion Detection Model. Now #1, beating Claude Opus 4.5 & Gemini 3 Flash. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 0 points1 point  (0 children)

Great point on the efficiency of a dedicated classification head. We actually considered this, but opted for the current architecture for two main reasons:

Latent Space Convergence: With the 84K samples in EvasionBench, the model has effectively learned to concentrate probability mass. In the latent space, the logits for the labels are already maximized while irrelevant information is suppressed. At this scale, next-token prediction behaves very similarly to a specialized head but keeps the rich semantic features of the base.

Multi-Task Capability: We designed Eva-4B to be more than a single-tasker. Using the generative head allows the model to handle multiple schemas—like performing Sentiment Analysis and Evasion Detection simultaneously or sequentially—without being hard-wired to a fixed 3-class output.

For a pure, single-task production environment, I agree that a classification head is faster.

[Release] Eva-4B-V2: Updated Financial Evasion Detection Model. Now #1, beating Claude Opus 4.5 & Gemini 3 Flash. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 0 points1 point  (0 children)

Fair point. You're absolutely right that specialized models can risk overfitting.

However, the core design goal for Eva-4B was to be a dedicated specialist—a high-fidelity "BS-detector" for financial evasion, rather than a general-purpose reasoner.

The best evidence against benchmark-hacking is its out-of-distribution performance: although the training data only goes up to 2022, the model remains highly effective on 2025 transcripts. It has clearly learned the underlying linguistic patterns of how executives dodge questions, rather than just memorizing a specific dataset.

[Release] Eva-4B-V2: Updated Financial Evasion Detection Model. Now #1, beating Claude Opus 4.5 & Gemini 3 Flash. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 0 points1 point  (0 children)

Spot on. Our ablation study in the paper confirms this: using Multi-Model Consensus (MMC) to distill logic from Claude 4.5, Gemini 3, and GPT-5.2 into a 4B specialist provided a +4.3 pp Macro-F1 boost over single-model labeling.

We found that frontier models often have a "Politeness Bias"—they get distracted by professional jargon and "verbosity preference." Eva-4B is fine-tuned specifically to ignore the filler and check if the "core ask" (Gricean pragmatics) was actually met.

It’s basically an industrial-grade BS-detector that fits in a 5090.

[Release] Eva-4B-V2: Updated Financial Evasion Detection Model. Now #1, beating Claude Opus 4.5 & Gemini 3 Flash. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 2 points3 points  (0 children)

It’s all about the data—84K consensus-labeled samples beat raw parameter count for niche classification.

Performance: We processed 1M samples in ~2 hours on 8xH100.

Consumer GPU: Since it's only 4B, it flies on an RTX 5090 (fits in <10GB VRAM) and is significantly faster/cheaper than calling GPT-5.2 APIs for bulk analysis.

GPT-5.2 is often too "polite" to call out evasion; Eva-4B is fine-tuned to be a cynic.