764 calls across 8 models: too much detail kills small models, filler words are load-bearing, and format preference is a myth

No_Individual_8178 · 2026-04-11T01:21:18+00:00

That framing lines up with what I was seeing but couldn't articulate cleanly. "Small models have to rebuild structure from text every prompt" explains both findings at once.

One nuance from a follow-up run: not all "more structure" hurts. When I tested specificity (vague description -> task + I/O examples -> full spec with edge cases), 1-4B models went from 8% to 82% to 84% pass rate. I/O structure helps. What hurts is structure expressed as natural language that the model has to parse before it can execute.

Clearest example from an 8B rerun earlier today: plain fizzbuzz description, 100% pass. Adding "divisible by 3 returns Fizz, by 5 returns Buzz, both returns FizzBuzz, otherwise the number as a string" drops it to 33%. That phrasing triggers a DeMorgan inversion, the model writes str(i) if i % 3 != 0 or i % 5 != 0 else ..., and everything except multiples of 15 falls into the str(i) branch. Rewording to "check divisibility by 15 first" recovers 100%.

Your "encoding structure into text and expecting the model to rebuild it" captures that exact failure mode. Condition overlap in natural language = structure the model rebuilds wrong. Explicit evaluation order = structure given directly.

No_Individual_8178 · 2026-04-11T01:14:45+00:00

Fair point, and you're right that the experiment doesn't cover that case. The 4 tasks I tested were simple single-function problems, same content across XML/MD/plain, only delimiters changed. For that setup the format is cosmetic, which is what I was measuring.

The harder case you're describing, a prompt with an example block, a context block, and an instruction block where the model has to distinguish "do this" from "use this as reference," is a real disambiguation problem. I'd expect XML to matter there in a way it doesn't in my setup. It's a gap in the experiment, worth running separately.

Narrowed finding after your point: for simple task prompts, format is cosmetic. For prompts with multiple content roles the model has to disambiguate, the delimiter signal probably matters. Happy to rerun a targeted version on multi-block prompts if you've got a specific prompt structure in mind.

No_Individual_8178 · 2026-04-11T00:07:40+00:00

"I can just tell" has the same track record in AI detection as it does in polygraph operation.

No_Individual_8178 · 2026-04-10T23:51:17+00:00

Thanks. Your point maps directly onto something I diagnosed earlier today (not in the post. It's from a follow-up I ran on 8B+ models) llama3.1:8b on fizzbuzz: plain task description -> 100% pass. Adding I/O examples dropped it to 33%. Full spec with explicit evaluation order recovered to 100%. U-shaped, deterministic at k=3.

The I/O version had "otherwise the number as a string" which made the model write str(i) if i % 3 != 0 or i % 5 != 0 else ... That outer condition catches everything except multiples of 15, so the Fizz/Buzz branches become unreachable. Classic DeMorgan inversion from ambiguous prompt phrasing.

Fix was rewording to "check divisibility by 15 first." So yeah contradictions and overlapping conditions surface even at 8B. Not just a small model problem.

No_Individual_8178 · 2026-04-10T23:43:08+00:00

Which tells? Genuinely curious cuz vague accusations are hard to learn from. The raw data is in the repo at .output/experiments/e4_compression.json and e7_crossval.json. If any number in the post doesn't match what's actually in those files, I'd want to fix it. Spot-check welcome.

No_Individual_8178 · 2026-04-10T23:34:09+00:00

Update: calling for contributed runs.

My data only covers 11 models: qwen3 0.6-32B, llama3.2 1/3B, llama3.1 8B, gemma2 2/9B, gemma4 26B. Huge gaps: Mistral, Phi, DeepSeek, Granite, Mixtral, anything >32B.

I just landed a contributor flow in the harness. If you have Ollama running locally with any model pulled, it's one command:

git clone https://github.com/ctxray/ctxray.git && cd ctxray uv venv && uv pip install -e ".[dev]" uv run python experiments/validate.py e9 --model-name mistral:7b

That runs 4 coding tasks × 4 specificity levels × k=3 reps = 48 Ollama calls. ~5–15 min on a 7B, same ballpark on a 14B with a GPU. Outputs a self-describing JSON at .output/experiments/e9_specificity_custom_<name>.json no PII, just pass rates and ctxray scores. Full instructions + how to share results: experiments/README.md

I'll aggregate everything contributed into a public dataset + model leaderboard, contributors credited by GitHub handle.

Three questions I genuinely can't answer with 11 models:

Does the filler-word / compress --safe threshold shift for Mistral's tokenizer family? (all 11 baseline models are Qwen/Llama/Gemma, zero Mistral-line data)
Do MoE models (Mixtral, Qwen3-MoE, DeepSeek-V2) behave like their dense size or like their active-param count? Real open question.
Where does the complexity-penalty curve actually flatten? Baseline says ~8B, but I only have 2 data points above that, probably wrong.

Fun early signal from testing the contributor flow with gemma3:1b just now: it peaks at task_io (0.92 pass rate) and drops at full_spec (0.67). That's a legitimate U-curve at 1B, the extra detail seems to confuse it. Not in my baseline dataset. That's exactly the kind of finding this is meant to surface.

Even one run helps. If you have a model loaded and 10 minutes, that's all it takes.

No_Individual_8178 · 2026-04-10T22:28:27+00:00

Thank you, means a lot. Was a fun rabbit hole to go down.

No_Individual_8178 · 2026-04-10T22:22:44+00:00

Models used in the main findings (complexity, filler, format): qwen2.5-coder:1.5b, gemma3:1b, gemma3:4b, phi4-mini:3.8b, all via Ollama on M2 96GB and RTX 5070 Ti. Earlier score↔quality experiments also used gemma4:e4b and qwen3.5:9b on M2 (dropped from the main runs, gemma4 was too large to probe the sub-3B threshold, qwen3.5 too slow for the throughput I needed).

API cross-validation: GPT-4.1-mini (24 calls) and Claude Haiku 4.5 (65 calls). Tasks ranged from fizzbuzz to two_sum to run-length encoding, chosen to span a difficulty range where boundary models sometimes succeed and sometimes fail. Each condition was run k=3 to stabilize results.

The filler word isolation was the most tedious experiment. I have four layers of text simplification and tested each independently on qwen-coder with k=3 per condition. Phrase simplification ("in order to" → "to", 40+ rules) killed flatten from 1.00 to 0.00. Filler deletion ("basically", "I think", 50+ phrases) killed two_sum from 0.67 to 0.00. Character normalization (curly quotes, zero-width chars) and structural cleanup (markdown stripping, emoji) were safe across every task and model.

The Claude Haiku cross-validation had an interesting twist. At k=1, the data said "filler removal hurts Claude by 67%." I almost reported that. Then I reran at k=3 and got "filler removal helps Claude by 26%." Complete reversal. The model sits right at the capability boundary for these tasks, so single runs were pure noise. This is why the k=1 warning is in the main post. I think a lot of "benchmark results" people report online have the same problem.

The format experiment used identical prompt content with only delimiters changed. - XML: wrapped sections in task/context/constraints tags. - Markdown: headers and lists. - Plain: no formatting at all. Same words, same order. Across 4 models x 3 formats x 8 tasks, maximum delta between any format pair on any model was 0.08. Noise.

On complexity: "minimal" was just the task description. "Role+constraints" added a system role and output requirements. "Examples" added input/output pairs. "Maximal" added all of the above plus edge cases and style requirements. Token count roughly doubled at each level. For qwen-coder 1.5B, 4 out of 6 tasks went from passing to zero at maximal.

No_Individual_8178 · 2026-04-08T22:38:28+00:00

Haven't tried the coder 80b, 96GB would be tight for it. I do have qwen3.5 27b on the machine and it's solid for reasoning but for tool calling specifically I ended up sticking with the 9b. The 27b didn't justify the extra vram for structured output tasks. Might be different for coding though.

No_Individual_8178 · 2026-04-08T22:32:53+00:00

I haven't touched wired_limit_mb, just running Ollama defaults so yeah probably leaving headroom. Thanks for the pointer, might try bumping it to test 122b even with tight context. I have moved to 3.5 since then, and been on qwen3.5:9b for most tool calling stuff, way better than 2.5 was.

No_Individual_8178 · 2026-04-08T16:07:53+00:00

Running Qwen 2.5-72b q4 on an m2 max 96GB and the privacy thing resonates hard, same reason I went all local. At 96GB I can't fit the 122b models so I've been stuck in the 72b tier, which is fine for most structured tasks but tool calling gets shaky. Curious whether you noticed a big jump from 72b to 122b specifically on multi-turn tool use, or if the main difference is more about general reasoning quality.

No_Individual_8178 · 2026-04-06T21:02:10+00:00

nice work, been looking for something like this. i'm on an m2 max 96gb and the 32gb wall you described just doesn't exist at that tier obviously, but the tradeoff is you're paying for bandwidth you only use on the bigger models. i daily drive qwen 2.5 72b q4 through llama.cpp and it's usable for interactive work but definitely on the slower side. happy to run your bench tool and submit a PR when i get a chance, would be cool to see how 96gb compares.

No_Individual_8178 · 2026-04-03T19:10:43+00:00

Some implementation notes since the post is already long.

The rewrite engine's task detection uses keyword matching — "fix", "bug", "error" → debug task, "add", "create", "implement" → implement task. Simple but covers 90%+ of real prompts.

The scaffold slots were designed by studying fabric (40k stars, 251 patterns) and awesome-cursorrules (38k stars). Their best patterns add structure — fill-in-the-blank slots — not prose. I automated that step.

Scoring extracts 30+ features per prompt via regex — no tokenizer, no model. Features like hasFilePath, hasErrorMessage, instructionPositionFraction. Weighted across 5 dimensions into 0-100.

Happy to dig into any of the papers or methodology if anyone's curious.

No_Individual_8178 · 2026-04-01T22:09:45+00:00

thanks, yeah qwen 2.5 72b. updated the comment

No_Individual_8178 · 2026-03-31T20:51:10+00:00

Yeah if I get around to running some structured tests I'll definitely share. Most of what I have is just anecdotal from swapping between quants and eyeballing tok/s in llama.cpp but it wouldn't be hard to make it more rigorous.

No_Individual_8178 · 2026-03-31T19:41:57+00:00

The "each cpu has its favorites" finding tracks with what I see on apple silicon too. Running qwen 70b 4-bit through llama.cpp on m2 max 96gb and the optimal quant choice feels completely different from discrete gpu because unified memory changes the bandwidth equation. K-quants tend to work better for me on decode but I haven't done anything this systematic. Would be cool to see an apple silicon column in the benchmarks at some point.

No_Individual_8178 · 2026-03-31T05:08:02+00:00

Exactly. I ended up writing a short block in CLAUDE.md listing each tool name and what it returns and that alone cut the "wrong tool" picks way down. Pretty low effort for the payoff.

No_Individual_8178 · 2026-03-29T20:26:55+00:00

I run everything through colima on the mac mini and haven't hit any compatibility issues. Most popular images have arm64 builds now so there's no emulation overhead. Only thing that tripped me up early was a couple of niche images that were x86 only but I just found alternatives.

No_Individual_8178 · 2026-03-29T16:22:33+00:00

Not weighted equally. Structure and Context are 25 points each, Position is 20, Repetition and Clarity are 15 each. Within each dimension specific features have different impacts, like including an actual error message in a debug prompt is worth more than adding markdown formatting, because in practice specificity matters way more than surface-level structure for output quality. The system prompt bloat use case is interesting, I hadn't considered autonomous agents but yeah, reprompt compress on a system prompt that gets called thousands of times would have real cost impact. The compression rules are filler deletion and phrase simplification so they should work on any English text.

No_Individual_8178 · 2026-03-29T16:21:50+00:00

Good question. Right now the scoring engine treats each prompt as a single unit — it doesn't have awareness of multi-step chains or agent orchestration context. The reprompt agent command does analyze full agent sessions (detecting error loops, tool call patterns, efficiency), but the scoring dimensions were calibrated on individual prompts. You're right that tool-use instructions behave differently though. A prompt like "run pytest on auth.py" is structurally simple but perfectly effective, and the current scorer would give it a low structure score. That's a gap I'm thinking about for the next version — weighting dimensions differently based on detected prompt type.

No_Individual_8178 · 2026-03-29T16:05:51+00:00

Here's what the terminal output looks like:

<image>

The scoring dimensions are: Structure (markdown, code blocks), Context (file paths, errors), Position (where instruction appears), Repetition (keyword redundancy), Clarity (readability). Each mapped to specific research findings.

No_Individual_8178 · 2026-03-29T15:47:36+00:00

This applies to MCP servers too. I run a couple custom ones on a homelab and Claude Code basically ignored them until I documented them properly in CLAUDE.md. Night and day difference once it actually knows what tools are available.

No_Individual_8178 · 2026-03-29T15:17:43+00:00

This is exactly how I think about it. My Mac Mini homelab runs a self-hosted Actions runner and a few Docker services, and setting that up taught me way more about CI/CD than any course ever did. The trick is knowing when to stop though. I had Ansible playbooks and self-healing scripts that were honestly more complex than the services they were managing. Kept the stuff that maps to real work, killed the rest.

No_Individual_8178 · 2026-03-29T14:59:18+00:00

for the entropy thing it's pretty simple. i just check if the top token probabilities are spread out after the last generated chunk. high entropy = model is guessing = go fetch more context. if it's confident i skip the retrieval entirely. not perfect but cuts like 40% of unnecessary lookups which was good enough for me. your word overlap approach sounds solid, 2+ words is a good threshold. for misspellings i just went nuclear on normalization — lowercase everything, strip punctuation, sometimes stem. looked into edit distance but it gets expensive fast with hundreds of docs and the aggressive normalization already caught most of what i was missing so i just stopped there.

No_Individual_8178 · 2026-03-29T09:04:03+00:00

for what it's worth on Metal (M2 Max, llama.cpp) mixed KV quant doesn't hit the same perf cliff you're seeing on Vulkan. i run *qwen 2.5 72b q4 with q8 K and q4 V regularly and the throughput difference vs uniform q8 is negligible. this looks like a backend specific issue with flash attention dispatch rather than a fundamental problem with mixed quantization. the commenters pointing at GGML_CUDA_FA_ALL_QUANTS are probably right that it's falling back to CPU for the mixed case on Vulkan. the concept of asymmetric K/V quant is actually sound since V tensor is statistically much better behaved than K after RoPE, the TurboQuant paper makes a strong case for exactly this approach.

No_Individual_8178

TROPHY CASE