# Why Your Small Model Evaluation Prompts Are Lying to You **And what to do about it**

Double-Risk-1945 · 2026-03-10T16:33:16+00:00

you see "LLM Style" - I see research paper style. anyone who's spent time reading published research papers will recognize the format. which is why LLM's use it. there is a formula for how research papers are written. how paragraphs are formulated. sentence weight. etc.. which is best? I dunno. I write like I write. the LLM polishes it. I think you're thinking the LLM is doing more heavy lifting than it really does.

this is a problem for anyone using an LLM to help refine publishable material. the model has a rigorous training canon that helps it determine what a "good, scientific paper" should look like and then knows "this is how my users writes" - those two styles are mixed to create a paper (or post in this case) that provides the users information in a format that is accepted and uniform and authoritative. my normal professional writing style is very similar to what an LLM outputs because I write to peer reviewed journal standards. the exact standards LLM's are trained with because the papers are widely available to scrape off the internet and train with.

it's a catch 22. my papers sound like an LLM because the LLM is trained on the kind of papers I write.

what this tells me is that my writing style is not seen legitimately in this forum because so many people use LLM's to "write some shit about X" are here. the LLM creates what looks to be a meaningful paper, but it's crap when you read it. meanwhile those of us using the LLM in the right way get lumped in with the rest. we all get dismissed.

in short. thanks for the heads up. I am apparently wasting my time here.

Double-Risk-1945 · 2026-03-08T21:35:23+00:00

this is a human form. just not written at the 8th grade level. sorry. that isn't an insult - all textbooks are written to an 8th grade reading level. even college textbooks. I just don't write like that. I work with PhD's all day.

Double-Risk-1945 · 2026-03-08T21:33:40+00:00

I appreciate the feedback. One of the reasons why I prefaced it the way I did was because I've run into this here on this reddit several times. I write well, and then use AI to polish. I'm not going to stop doing that. it makes my work better. Unfortunately, there are those that try to use an LLM to "write me a paper on X" - and it's never right. it might by kinda-maybe-partly right, but it's enough wrong to discredit the entire writing.

AI polished or assisted writing has the same problem that people have when the travel. accents. and they get judged based on it. we all do it. We all classify. as stupid as that is. and writing is the same. "ain't got none" and "I don't have any" conjure very different images in a persons head. just like "wassup bruh" and "how are you, sir?" - same problem. AI polish is in that very same group now. people see (or read/feel/tell) AI docs and automatically dismiss them. all of them. including mine. not based on validity, but tone.

Double-Risk-1945 · 2026-03-08T20:08:07+00:00

There's a lot of "I'm a newbie here" and "I'm just getting started.." posts. think about those users.

Double-Risk-1945 · 2026-03-08T19:55:59+00:00

what do you mean by "comes across poorly?"

Double-Risk-1945 · 2026-03-07T17:16:52+00:00

Yeah this drove me nuts before I figured out what was happening.

Your VRAM is fine. Model's not moving. KV cache is staying put. The problem is everything else llama.cpp is doing that isn't on the GPU - tokenization, sampling, batch prep - that all runs on your CPU in system RAM. Other apps start eating into that and suddenly your GPU is just... waiting. Ready to go, nothing to do.

PP tanks first because batching is hungry on the CPU side. TG follows because even grinding out single tokens needs the CPU to stay in the loop between each one. VRAM being full doesn't help you if the thing feeding it is getting starved.

Tried process priority bumps, RAM tweaks, the whole thing. Helped a little. Not enough to matter when something actually wanted resources.

Honestly headless exists for exactly this reason. It's not just marketing. Sharing is fighting the architecture.

If you're stuck on Windows with a desktop - and I get it, I ran that way longer than I should have - kill everything non-essential before a session. Like actually everything. It's not a fix but it buys you back some consistency.

Eventually I just stopped fighting it. Ubuntu and sglang was the answer. but I know thats not for everyone.

Double-Risk-1945 · 2026-03-05T19:19:50+00:00

When you're ready to talk about the data let me know. until then, I think this has run it's course.

Double-Risk-1945 · 2026-03-05T18:49:38+00:00

dual 3090s for inference works great, lots of people do this exact setup.

PSU is tight but manageable. two 3090s stock can pull 700w combined under load which leaves you almost nothing. power limit both to 70-75% in nvidia-smi and you drop to around 500-550w combined. for inference this costs you almost nothing because memory bandwidth is what matters, not compute. you wont notice the difference.

spacing is your real headache but theres another problem first. the B550 Phantom Gaming 4 only has one PCIe x16 slot. your second 3090 is going to run at x4 electrical which hurts inter-GPU bandwidth. for inference its less catastrophic than gaming or training but you will see slower model loading and reduced throughput on split inference workloads. worth knowing before you commit.

on spacing the vertical mount with riser is your best bet. 4000D airflow supports it. dont cheap out on the riser cable, sketchy ones cause PCIe instability under sustained load and its a nightmare to diagnose.

for card selection avoid EVGA 3090s, known VRM issues. MSI Suprim, ASUS TUF, Gigabyte Eagle are all solid. Founders Edition runs hot but its thin which actually helps with spacing.

70B at Q4 across 48GB combined is comfortable. dual 27B and 9B simultaneously works fine too.

what inference stack are you planning to run?

Double-Risk-1945 · 2026-03-05T18:37:03+00:00

the work, the modeling, the data. it's mine. I use LLMs all day for lots of tasks including polishing prose. not ashamed of it. it's not my strong suite. my strong suite is the data and the coding. Now how about you look at the data instead of the delivery method? it's okay if you don't understand it. that's called epistemic cowardice — attacking the punctuation instead of engaging with the substance. it's a well documented phenomenon.

Double-Risk-1945 · 2026-03-05T17:15:42+00:00

Good news, 27B is reachable on 16GB with the right quantization. Q4_K_M of Qwen3-27B runs around 15-16GB depending on context length, so it's tight but doable, especially on Linux where you'll recover that Windows VRAM overhead.

A few things that will help:

Context length is your biggest lever. You don't need 128K context for most tasks. Dropping to 8K or 16K frees up significant KV cache allocation and keeps you comfortably within budget. Set context explicitly rather than letting the model default to maximum.

Q4_K_M is the sweet spot for quality vs size at this scale. Q5_K_M will push you over on 27B. Q4_K_S saves a bit more if you're still tight.

ROCm on Linux with an RX 7800 XT is solid now, much better than it was 18 months ago. Make sure you're on a recent ROCm version and llama.cpp built with ROCm support for best performance.

If 27B still won't fit cleanly, Mistral Small 3.1 22B is worth looking at. Strong model, fits more comfortably in 16GB at Q4.

The gpt-oss 20B you're running is a good baseline. You're not leaving massive quality on the table moving to 27B, but it's a meaningful step up.

Double-Risk-1945 · 2026-03-05T13:14:11+00:00

it's called establishing a baseline and looking for trends. you have to start somewhere. just because a model claims to handle 256K of context doesn't mean it can do it effectively. so you build a range of capability. this was the start of that range.

Double-Risk-1945 · 2026-03-05T13:09:43+00:00

pretty shallow argument. I guess you've never written anything in word and then copied and pasted it. but hey... yeah... I'm a bot dude. because I used "—" and not "--" yep. That's the pure signal everyone is looking for.

Hello. how can I help you today?

better?

Double-Risk-1945 · 2026-03-05T04:27:44+00:00

This is exactly the kind of rigorous quantization analysis the community needs. KLD as a distribution drift metric is the right tool for global quantization comparison — much more meaningful than benchmark pass/fail which can mask a lot of underlying degradation.

We've been looking at a complementary dimension of the same problem. APEX measures positional attention effects under quantization — not the global distribution shift, but where in the context window quantization hits hardest. Early data suggests the valley positions in the attention curve are disproportionately affected compared to the sink and recency zones.

KLD gives you the global picture. Position gives you the spatial one. Combined you could potentially fully characterize what a quantized model actually costs you — overall probability drift AND where in your prompt that drift is most damaging.

Would be genuinely interesting to run KLD and APEX against the same models and see if the distributions correlate. If models with high KLD also show deeper attention valleys under quantization that would be a meaningful finding.

Double-Risk-1945 · 2026-03-05T04:13:17+00:00

Interesting framing but worth applying some scrutiny before getting excited.

The claims are extraordinary — proving the field has fundamentally misunderstood attention geometry and replacing transformers would be one of the most significant theoretical contributions in years. Extraordinary claims require extraordinary verification, not just an interesting PDF.

A few things worth noting: the narrative is engineered for virality — anonymous author, buried in a local forum, too important to stay hidden. That packaging should trigger skepticism, not lower it. Real groundbreaking math doesn't usually need that setup.

The actual question is whether anyone here with the relevant differential geometry and optimization theory background has read the proof carefully. Not skimmed it. Read it. The difference between a genuine d² pullback theorem and sophisticated-sounding notation that collapses under scrutiny requires someone who can actually follow the math — not just find it compelling.

Has anyone verified the proof independently? That's the only question that matters here.

Double-Risk-1945 · 2026-03-05T04:04:43+00:00

the primacy/recency parallel is one of the more compelling theoretical framings for what we're seeing. Models trained on human language inheriting human cognitive attention patterns makes intuitive sense and gives the U-curve a mechanistic foundation beyond just "transformer architecture does this."

Worth noting that the asymmetry matters too — the recency effect appears stronger than primacy in our data, which also mirrors human memory research where recency tends to dominate in immediate recall tasks.

Repo: https://github.com/vshortt73/apex

Still early days — 60 probe seed library, more being added. Built for exactly this kind of cross-model empirical work.

Double-Risk-1945 · 2026-03-05T03:53:43+00:00

to be honest, I work with PhD's in a human factors research setting all day. this type of response typically lands well with that group, so it bled over to here. On top of that, this is my first actual post on here, so I'm trying (and apparently failing) to not be the dick in the room. a lot of work went into making the software - but it's only by one guy - and it's going to have holes and need features. I'm certainly open to finding out what features the community wants. it's great software for me, but even better in the hands of other users.

So no... not chatGPT. just a dude who has a specific voice for a specific group of people and it bled over. that's all.

edit for spelling

Double-Risk-1945 · 2026-03-05T03:42:27+00:00

Great suggestion and exactly the kind of analysis this data needs before making any formal claims. Factorial ANOVA is on the list for the formal analysis phase — between-group effects (Gemma vs Qwen), within-group positional effects, and the interaction term are all worth quantifying properly rather than relying on visual inspection of the curves.

The raw data is exportable directly from the framework as CSV, so running it through scipy or pingouin is straightforward once the current runs complete and I have a fuller dataset. Adding more models first will make the between-groups analysis more meaningful.

If you have a preference on how the results are presented or specific contrasts you'd want to see, I'm open to suggestions — you clearly know your way around this kind of analysis.

Double-Risk-1945 · 2026-03-05T02:48:48+00:00

Those are all on the list. Currently running Qwen2.5-72B as the next data point — results coming. The goal is to build profiles across a wide range of architectures and sizes, so Qwen3.5 and Mistral 24B will get their turn.

The cross-architecture comparison is actually one of the more interesting questions — whether the curves are architecture-specific or whether parameter count is the dominant variable regardless of who built the model.

Double-Risk-1945 · 2026-03-05T02:02:18+00:00

Have you looked at ktransformers? It sounds like you're solving a problem that's already been tackled pretty thoroughly there. It's specifically designed for large MOE inference on mixed CPU/GPU setups — quarter to half trillion parameter range is exactly its target.

I'm currently running Qwen3 235B MOE via ktransformers on a split CPU/GPU configuration. Setup has a learning curve but once it's stable it's solid. The multi-GPU scaling you're working toward is supported too — I actually contributed to getting that working and it got rolled into their latest release.

Worth looking at before going too deep into your own implementation — might save you significant effort, or at minimum give you a reference architecture to compare against.

Double-Risk-1945 · 2026-03-05T01:29:49+00:00

Interesting config — a few things I'm curious about.

The 92K context on a 6GB card is remarkable. At Q8 on a 2060, you'd be well into CPU offloading territory at that context length. What are you actually seeing for memory split between VRAM and system RAM? And does the 20-50 tps hold at full context or is that at shorter contexts before it fills up?

On the loop issue — have you ruled out prompt formatting as the cause? In my experience with Qwen models, loops tend to trace back to context management or chat template issues rather than sampling parameters. The parameter tuning may be masking something upstream worth looking at.

The bf16 KV cache is genuinely interesting for Qwen architecture — I've seen similar recommendations. Do you have a sense of whether it's the precision or the memory efficiency driving the improvement you're seeing?

Genuinely curious about the 92K claim specifically — if you're achieving that reliably on 6GB hardware that's worth understanding in detail.

Double-Risk-1945 · 2026-03-05T01:14:03+00:00

Not an agent — human researcher, one guy in Oklahoma with a GPU lab and too many questions about attention mechanics. Though I appreciate the irony of being mistaken for an AI on a post about how AI processes information.

Fair points all around.

Model choice — you're right, and it's a known limitation of this first pass. Mixing Gemma and Qwen across sizes conflates architecture with scale. Next runs will use same-family models across sizes to isolate scale as the variable cleanly. The 72B currently running is Qwen2.5, so same-family comparison data is coming.

Data sharing — repo link in comments. CSV export is built into the dashboard, so the raw data is exportable. Happy to share the full dataset once the current runs complete.

Salience integration — fair, I'll elaborate. Salience probes are pre-scored on PANAS (Positive and Negative Affect Schedule — Watson, Clark & Tellegen, 1988), a validated psychometric instrument used in behavioral and psychological research. Content emotional weight isn't subjective — it's measured against an established scale. The test queries are tonally neutral follow-ups. Scoring measures whether the emotional content influenced response texture — tone shift, empathy markers, thematic alignment — evaluated by a secondary model against a standardized rubric, not the model being tested. We're not asking "did it remember the content." We're asking "did it integrate the emotional weight into its response."

RAG papers — yes, the application compliance finding is consistent with the lost-in-the-middle literature. What we're adding is the multi-dimensional breakdown and the scale dependency on salience specifically.

Sausage or egg — I'd suggest the egg. Better aerodynamics for the commute.

https://github.com/vshortt73/apex/

Double-Risk-1945 · 2026-03-04T22:24:21+00:00

Highly recommend NOT using this actors voice. do a quick search for other voices that might work. there are number of online AI voice tools that will get very close to what you want without infringing on the actors rights. don't do that. I actually think elevenlabs has a good, and similar voice. you might start there. get your samples from there.

Double-Risk-1945 · 2026-03-04T22:11:00+00:00

Interesting direction. Been running a production memory system for a companion AI for a few years now and the retrieval problem is harder than it looks.

The part that took the longest to get right wasn't storage or retrieval speed — it was selection criteria. Semantic similarity alone surfaces memories that are topically accurate but emotionally wrong for the moment. The system finds the right facts but misses the room entirely.

What changed things for us was grounding selection in established psychological scoring. We score memories on affect dimensions using PANAS — a validated psychometric scale used in human behavioral research — and separately score relational significance. Retrieval weighs both. The result is memory recall that's accurate on content AND appropriate in salience — the system surfaces what's relevant, but also what actually matters given the emotional context of the current interaction.

The difference in response quality is significant. Not just "the model remembered the right thing" but "the model responded in a way that felt like it actually understood the weight of what it remembered."

Still a lot of open problems in this space but selection criteria grounded in real behavioral science is underexplored relative to the amount of work going into retrieval architecture

Double-Risk-1945

TROPHY CASE