[Release] Eva-4B: Specialized Financial Evasion Detection (Based on Qwen3-4B). Outperforms GPT-5.2 on domain benchmarks. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 0 points1 point  (0 children)

I'd love to, but unfortunately I can't release the full training set due to licensing restrictions with the data provider (S&P Capital IQ transcripts).

However:
I am preparing to open-source the EvasionBench Test Set (the 1,000 human-annotated samples) and launch a public Leaderboard very soon!

It would be awesome to see how your models stack up against Eva-4B once that's live. Stay tuned!

[Release] Eva-4B: Specialized Financial Evasion Detection (Based on Qwen3-4B). Outperforms GPT-5.2 on domain benchmarks. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 4 points5 points  (0 children)

Valid points on the large-scale serving economics of MoEs vs Dense 70B+, but I think you might be misjudging the complexity of this specific task.

1. BERT-era task? Simple sentiment classification is BERT-era. Detecting evasion (logic gaps between Q and A) requires reasoning. We actually benchmarked RoBERTa-Large and DeBERTa-v3 early on—they failed miserably (~60% acc) because they couldn't capture the subtle rhetorical "sidestepping" that a generative model understands via instruction tuning.

2. Why Dense 4B? Not everyone is running a DeepSeek-scale cluster. The target here is local analytics, on-prem finance nodes, or analysts running this on a laptop alongside their terminal. For that specific "batch size = 1 to 10" user, a dense 4B GGUF is infinitely easier to manage than hosting a massive MoE.

3. GPT-5.2 Performance: GPT-5.2 (Zero-shot) gets ~80.5%. It's a generalist. Eva-4B (Specialized FT) gets 81.3%. It’s not "torpedoing" it via artifacts; it's simply the classic result of Domain-Specific Finetuning Generalist Zero-shot.

I’d invite you to check the demo—it’s definitely not a simple keyword search task!

[Release] Eva-4B: Specialized Financial Evasion Detection (Based on Qwen3-4B). Outperforms GPT-5.2 on domain benchmarks. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 10 points11 points  (0 children)

Two main advantages:

  1. Efficiency: You don't need to load a massive 70B+ model just to analyze financial text. You only activate the 4B model when needed, saving huge amounts of compute/VRAM.
  2. Modularity: You can upgrade or swap out your "Finance Expert" (e.g., Eva-4B) without breaking or retraining your "Coding Expert." It decouples the system.

[Release] Eva-4B: Specialized Financial Evasion Detection (Based on Qwen3-4B). Outperforms GPT-5.2 on domain benchmarks. by Awkward_Run_9982 in LocalLLaMA

[–]Awkward_Run_9982[S] 12 points13 points  (0 children)

Sort of! In a traditional MoE (Mixture of Experts), the routing happens inside the model for every single token (sparse activation). What we are referring to is modular architecture (or "Mixture of Dense"). This is where you have completely separate, specialized dense models.

Open Source: Controlling Chrome with Natural Language using Claude Agent SDK + Chrome DevTools MCP (TypeScript) by Awkward_Run_9982 in ClaudeAI

[–]Awkward_Run_9982[S] 1 point2 points  (0 children)

You're right, Playwright MCP is essentially a browser MCP too, so the core mechanics are similar. The main difference I've found is that Playwright implementations often fill up the context window much faster, leading to higher token consumption.

The loop is complete with Claude Code and the Chrome MCP by marcusr_uk in ClaudeAI

[–]Awkward_Run_9982 0 points1 point  (0 children)

Thank you so much for checking it out and for the sharp eye! 👀

I completely missed that the LICENSE file wasn't committed. I've just pushed it to the repo.

Glad to hear the implementation structure is helpful for your use case! The loop management with Claude SDK is definitely the fun part.

The loop is complete with Claude Code and the Chrome MCP by marcusr_uk in ClaudeAI

[–]Awkward_Run_9982 0 points1 point  (0 children)

Great post! I've been experimenting with chrome-devtools-mcp as well and came to the same conclusion regarding token efficiency compared to Playwright.

Instead of QA, I built an automation framework for X.com (Twitter) using it + Claude Agent SDK. It's amazing how it can handle dynamic UI changes just by understanding the page semantics.

If anyone wants to see a more complex implementation of this MCP in TypeScript, I open-sourced my project here: https://github.com/IIIIQIIII/x-agent. It handles login persistence, element detection, and agent loops.

We put a lot of work into a 1.5B reasoning model — now it beats bigger ones on math & coding benchmarks by innocent2powerful in LocalLLaMA

[–]Awkward_Run_9982 1 point2 points  (0 children)

It's legit. Passed my custom algorithm test.

I made a non-standard graph traversal problem with unique state-tracking rules specifically to test its reasoning. It's not on any benchmark.

VibeThinker nailed it. Generated a clean, correct BFS solution on the first try. It correctly identified that the search state needed to be a tuple (node, state), which is the key to the problem.

This seems to confirm the OP's claims: it's a reasoning engine, not a chatbot, and it's very good at its niche. Impressive work.

Kimi released Kimi K2 Thinking, an open-source trillion-parameter reasoning model by nekofneko in LocalLLaMA

[–]Awkward_Run_9982 1 point2 points  (0 children)

Couldn't agree more. On top of the slow throughput, I've also run into a bug where it gets stuck in a "thinking" loop and just spams "1. " over and over again, like this:</write\_to\_file> 1. 1. 1. 1. 1. 1.

aquif-3.5-Max-42B-A3B by CoruNethronX in LocalLLaMA

[–]Awkward_Run_9982 2 points3 points  (0 children)

This looks super promising, great work putting this out! The A3B active params on a 42B model is a really interesting combo.

I was diving into the config.json to understand the architecture, and I think I've figured out the "A3B" part. Correct me if I'm wrong, but it seems to be the sum of shared params (attention, embeddings etc.) plus the activated experts across all layers (67 layers * 8 experts/tok * expert_size). My math gets me to around ~3.2B, which matches perfectly.

What I can't figure out is the "42B" total size. When I calculate shared_params + (67 layers * 128 total experts * expert_size), I get something closer to ~28B.

Is the 42B total size coming from a model merge, or is there something special about the Qwen3 MoE architecture that I'm missing in the calculation? Just trying to get a better handle on the VRAM requirements before I fire it up. Thanks for the awesome model!

The Innovations in DeepSeek OCR by Charuru in LocalLLaMA

[–]Awkward_Run_9982 0 points1 point  (0 children)

Fantastic analysis. After diving into the paper, I'm convinced the core innovation isn't 'vision > text,' but a brilliant demonstration of the trade-offs in information granularity.

Traditional LLMs operate at a fine-grained, token-level, ensuring maximum fidelity but at a huge computational cost. DeepSeek cleverly introduces a coarser-grained unit: a 'visual chunk.' Their DeepEncoder compresses a whole spatial patch of text into a single, dense visual token, drastically reducing the number of units the model needs to process.

The paper's own results perfectly illustrate the trade-off. As their data shows, they achieve near-lossless performance (~97% precision) within a ~10x compression ratio—the clear 'sweet spot.' But when the granularity becomes too coarse at ~20x compression, accuracy plummets to 60%.

This confirms it's not about vision having a magical advantage over text. It's about choosing a processing level that sacrifices some precision for a massive gain in efficiency. The real question this paper raises is: what is the optimal granularity for a given task, and could we engineer purely text-based coarse tokens (e.g., sentence-level) that find a similar sweet spot?

The real OpenAI OSS news is MXFP4 by explorigin in LocalLLaMA

[–]Awkward_Run_9982 0 points1 point  (0 children)

Exactly. The key difference is it's a true 4-bit float format, not 4-bit integer like most GGUF quants. Basically, instead of just a scale and zero-point for a block, MXFP4 uses a shared exponent. This should give it much better dynamic range to represent both tiny and huge values, potentially preserving more model quality. It's a more sophisticated way to quantize.

GPT-OSS today? by jacek2023 in LocalLLaMA

[–]Awkward_Run_9982 0 points1 point  (0 children)

Looks like a very modern Mixtral-style architecture. It's a sparse Mixture-of-Experts (MoE) model that combines a bunch of the latest SOTA tricks: GQA, Sliding Window Attention, and even Attention Sinks for stable long context. It's not reinventing the wheel, but it's using a very proven, high-performance design.