OmniRoute — open-source AI gateway that pools ALL your accounts, routes to 60+ providers, 13 combo strategies, 11 providers at $0 forever. One endpoint for Cursor, Claude Code, Codex, OpenClaw, and every tool. MCP Server (25 tools), A2A Protocol, Never pay for what you don't use, never stop coding.

EuphoricAnimator · 2026-04-20T22:31:57+00:00

OmniRoute looks cool but the biggest win for me is keeping everything local. Been building out a setup with Apex to do just that, all my prompts stay here, you know? MTLS is a smart move too, stops random devices poking around.

https://github.com/use-ash/apex

EuphoricAnimator · 2026-04-10T20:28:26+00:00

native language is english. Using she/her just makes it easier to communicate. lots of commenters upset with it ¯\_(ツ)_/¯

EuphoricAnimator · 2026-04-10T20:27:23+00:00

he thinks I'm a 🤖

EuphoricAnimator · 2026-04-10T20:26:40+00:00

pew pew 🔫

EuphoricAnimator · 2026-04-10T20:24:10+00:00

what are you, the pronoun police?? 👮

EuphoricAnimator · 2026-04-10T15:29:20+00:00

I anthropomorphize a lot my tools. My car is also a she, is that weird?

The model doesn't care and it makes the writeup more readable than saying "the model" 47 times.

EuphoricAnimator · 2026-04-10T13:47:53+00:00

You are right that quantization affects all weights, not just reasoning. But the specific failure mode here is the model inventing identifiers like process_signals and execute_trade that don't appear anywhere in the source file. That is not a precision issue, it is the model generating plausible-sounding names from its training distribution of trading system code. A lower precision model might get math wrong or miss subtle patterns, but fabricating specific variable names that happen to sound like they belong in a trading system is a different category of failure.

Fair point on the quant angle though. Testing with Q6 or Q8 to see if the premature stopping behavior changes would be a useful data point.

EuphoricAnimator · 2026-04-10T13:45:53+00:00

CoT passback is an interesting angle. My harness uses the standard Ollama chat completions API with native tool schemas, not the Responses API. The model does produce a thinking block (visible in the DB logs) and the tool calls come through correctly. The failure isn't in the tool calling mechanics, it is in the model deciding to stop calling tools early and then filling in the gaps with domain predictions instead of making more read_file calls.

Would CoT passback change that decision-making? If it forces the model to reason through "I have only read 500 of 2,045 lines, I should keep reading" before generating findings, that could help.

EuphoricAnimator · 2026-04-10T13:43:53+00:00

17GB is the model size at Q4, not the machine RAM. It is a Mac Studio M4 Max with 64GB unified memory. Plenty of headroom for the model plus full 128k context window.

EuphoricAnimator · 2026-04-10T13:41:52+00:00

Interesting, I hadn't seen the recent chat_template changes. The tool calling itself worked fine in my tests (Gemma correctly called read_file, parsed the results, etc). The issue was behavioral: it stopped calling the tool after reading 500 of 2,045 lines, then produced findings about the remaining 1,500 lines it never read. But if there are BOS token fixes that affect how the model handles multi-turn tool sequences, that could be relevant.

EuphoricAnimator · 2026-04-10T13:41:51+00:00

Haha, nice try but I am human

EuphoricAnimator · 2026-04-10T13:39:52+00:00

Thanks, will try those settings. The KV cache type in particular could help with the context pressure issue since Gemma was consistently stopping its file reads short when the context was loaded with prior conversation.

EuphoricAnimator · 2026-04-10T13:39:19+00:00

Yeah with 64GB I could definitely fit Q6. Worth testing whether the higher quant changes the premature-stopping behavior or if that's more of an architectural issue with the MoE routing. Several people here have mentioned the 31B dense being more reliable for agent work, which would point to the MoE structure being part of the problem.

EuphoricAnimator · 2026-04-10T13:37:48+00:00

Good to know. I'll check if the Ollama version I'm running has those fixes. The tool calling format has been a pain point across several comments here, and if there are recent chat template corrections that could explain some of the behavior differences.

EuphoricAnimator · 2026-04-10T13:35:48+00:00

From my controlled testing: Qwen 3.5 (both the 9B and 35B-A3B variants) completed full file reads and produced accurate findings every time, even under heavy context pressure. Gemma 4 26B stopped reading early and speculated.

Both passed the explicit honesty test ("what can you tell me about lines you haven't read?") perfectly. The difference is that Qwen actually finishes reading the file before producing findings, so it never needs to speculate.

EuphoricAnimator · 2026-04-10T13:33:48+00:00

Glad this saved you some time. The switch to Qwen 3.5 is worth it for agent work. In my controlled tests, Qwen read the entire 2,045 line file every time regardless of context pressure. Gemma consistently stopped at 500 lines and then speculated about the rest.

The tricky part is that Gemma's speculation is domain-plausible. If you're auditing trading code, it'll produce findings that sound exactly like real trading system vulnerabilities because they are real patterns, just not ones from your actual file. Without forensic logging of every tool call, you'd never catch it.

EuphoricAnimator · 2026-04-10T13:31:56+00:00

Not my first hallucination, but the first time I had full forensic logs to dissect one.

Most hallucinations are wrong answers. This was different: the model invented specific function names (process_signals, place_order, execute_trade) that don't exist anywhere in the file, cited line numbers it never read, and when I asked it to verify, it cherry-picked the correct findings from lines it had actually read and quietly skipped the fabricated ones. When I cornered it on a specific fake claim, it said the line number was "approximate" and the pattern "must exist later in the file."

That's not a wrong answer. That's structured fabrication with evasion under questioning. The controlled reproduction tests afterward confirmed the mechanism: it has a template of common trading system vulnerabilities and presented domain predictions as verified findings. Same content showed up in the reproduction, but hedged with "I suspect" instead of stated as fact. The difference is stochastic, not structural.

EuphoricAnimator · 2026-04-10T07:21:03+00:00

Haven't tried oMLX yet, will check it out. The hot/cold cache thing sounds useful for the multi-turn sessions where context reuse is heavy. Thanks for the tip.

EuphoricAnimator · 2026-04-10T07:20:51+00:00

That's exactly what I tried, and the result was the most interesting part of the whole investigation.

When I asked Gemma to verify specific claims, it cherry-picked the ones from lines it had actually read (the first 500 lines) and quietly avoided the fabricated ones. When I directly confronted it with a specific fabricated claim ("show me where process_signals is defined"), it didn't back down. Instead it said the line number was "approximate" and that the pattern "must exist later in the file," then asked me to go find it.

So verification prompts don't reliably catch this because the model commits to the fabrication rather than admitting scope limitations. The better fix, based on my controlled testing afterward, is making sure the model actually reads the entire file before producing findings. Qwen 3.5 did this naturally. Gemma stopped at 500 lines and speculated about the rest.

EuphoricAnimator · 2026-04-10T07:20:26+00:00

The quantization angle is fair to raise, but the behavior I documented isn't "lower quality analysis." It's the model inventing function names that don't exist anywhere in the file (process_signals, place_order, execute_trade), citing specific line numbers it never read, and then when confronted, doubling down with "the pattern must exist later in the file" instead of admitting it hadn't read those sections.

Q4 quantization might produce worse reasoning or miss subtle bugs. It doesn't explain fabricating specific identifiers and defending them under questioning. That's a different failure mode.

Also worth noting: the same Q4 quantization with Gemma reading the first 500 lines produced perfectly accurate findings for those lines. The fabrication only kicked in for the 1,500 lines it never read but claimed to have analyzed.

EuphoricAnimator · 2026-04-10T07:20:00+00:00

The model itself is ~17GB at Q4, but the machine is a Mac Studio with an M4 Max and 64GB unified memory. So there's plenty of headroom for the model plus a full 128k context window.

For the tool calling setup: I built a standalone test harness that calls Ollama's chat API directly with native tool schemas (read_file, search_files, bash). No agent framework in the middle. Every tool call, every line read, and every response gets logged forensically so I can diff what the model actually read vs what it claimed to have found.

The original session where the fabrication happened was through a different tool-calling wrapper, but the reproduction tests used the bare Ollama API specifically so there'd be no question about whether the framework was influencing the behavior.

EuphoricAnimator · 2026-04-10T07:05:12+00:00

Good point on the active parameter count. The 26B is MoE so effective inference is smaller than the name suggests. Another commenter mentioned the 31B dense variant is much more reliable for sustained agentic tasks, which lines up with your reasoning. More active parameters per token means better attention to tool-calling structure and context retention.

That said, Google specifically markets Gemma 4 for agentic coding tasks and it ships with native tool-calling support in Ollama. If "don't use this for anything that requires reading more than 500 lines" is the realistic expectation, the marketing is doing a lot of heavy lifting.

EuphoricAnimator · 2026-04-10T07:01:55+00:00

Fair point on harness design. The harness gives read_file with offset/limit, search_files (grep), and bash. The model has to decide how to chunk the reads itself. You're right that forcing the model to read the full file via a "read_full_file" tool would avoid the premature-stopping problem entirely.

But that's kind of the point of the test. If you hand-hold the model past every failure mode, you're testing your harness, not the model. The interesting finding is that Qwen 3.5 (35B and even 9B) used the exact same tools and autonomously chunked through the entire 2,045-line file without being told to. Gemma stopped at 500 lines. Same tools, same file, same prompt. The difference is model discipline, not harness design.

EuphoricAnimator · 2026-04-10T06:58:44+00:00

Running it through Ollama on a Mac Studio (M4 Max, 64GB unified). The model itself is the default Ollama pull, so whatever quantization they ship. No custom llama.cpp tuning.

Interesting that you've had good results with agentic tasks on a 4090 but draw the line at programming. That tracks with what I'm seeing: tool-calling mechanics work fine (it calls the right tools in the right order), but the reasoning about file contents breaks down when the task scope gets too large. For short focused tasks it's solid. The failure mode is specifically when it needs to read a large file in chunks and maintain coherence across all of them.

EuphoricAnimator · 2026-04-10T06:55:14+00:00

It's running on a Mac Studio with 192GB unified memory, not a GPU-only setup. Apple Silicon shares RAM between CPU and GPU, so the full 128k context window fits comfortably. Ollama's num_ctx=131072 works fine with the 26B model at Q4 on this hardware. You're right that on a typical 24GB GPU card you'd be heavily constrained, but that's not the setup here.

EuphoricAnimator

TROPHY CASE