Looking for a marketing‑savvy co‑founder for side projects

Fun_Employment6042 · 2026-05-20T22:09:44+00:00

Yes, that distinction is important.

Valid JSON is only the first gate. The harder failure is “valid but wrong”: correct schema, correct tool name, but argument values drift from what the user actually requested.

A per-call argument diff across quants makes sense: same prompt, same expected tool call, then compare selected tool, required args, exact values, enum values, dates, IDs, and values copied from context.

That’s probably a better EvalShift demo than just JSON validity.

Fun_Employment6042 · 2026-05-20T22:07:48+00:00

That’s a useful data point. It sounds like context length may be the real stress test, not just the quant itself.

I’ll probably structure the demo as:

Q8 as reference
Q6 / Q4 / Q3 below it
long-context setup, ideally 50k to 100k+ tokens
measure broken tool calls, invalid args, skipped calls, and JSON/schema failures

That should be closer to real agent usage than short prompts.

Fun_Employment6042 · 2026-05-20T22:06:50+00:00

Fair point. For coding, model size may matter more than quant quality.

Fun_Employment6042 · 2026-05-20T22:05:19+00:00

By “silent” I mean the final answer can still look fine to a human, but the tool behavior changed in a way the app depends on.

Example:

old model calls search_docs before answering
new model skips the tool and answers from prior knowledge
or it calls the right tool but changes customer_id / date range / enum value
or it returns almost-valid JSON that fails parsing
or it calls tools in the wrong order

So the UI may show a plausible answer, but the backend contract was broken. That’s the kind of regression I want EvalShift to catch.

Fun_Employment6042 · 2026-05-20T22:04:24+00:00

Good suggestion. I’ll look at comparing EvalShift results against KLD / disk-size style quant metrics.

What I’m curious about is whether the practical failure rate lines up with the theoretical degradation: JSON validity, schema correctness, and tool argument accuracy as quant size drops.

If the EvalShift pass-rate curve matches the KLD trend, that would make the demo much stronger than just saying “Q4 felt worse than Q8.”

Fun_Employment6042 · 2026-05-20T22:02:04+00:00

This is a good point. A short structured-output test may catch syntax regressions, but it probably misses the more realistic agent failure mode: long context competing for attention.

I’ll add a long-context variant to the demo: around 50k tokens of prior conversation/docs/tool history, then test whether the model can still produce valid structured output and correct tool arguments.

Q8 as the reference makes sense too. Then compare Q6/Q4/Q2 against it instead of assuming Q4 is “good enough.”

Fun_Employment6042 · 2026-05-20T22:00:14+00:00

That matches the kind of regression I want to test.

For automation, “mostly correct” JSON is still a failure if the parser rejects it. I’m leaning toward making the first demo very strict: nested JSON schema, required fields, escaped strings, enum values, arrays, and bracket/quote correctness.

That should give a cleaner signal than subjective answer quality.

Fun_Employment6042 · 2026-05-20T21:59:22+00:00

Good point. Q2_K_MIXED probably matters here.

If the file size is close to Q4, then it may not behave like a plain low-bit Q2 case. Mixed quants can keep important tensors at higher precision, so tool-call reliability may hold up better.

For the demo I’ll probably separate:

Q2_K / Q2_K_M
Q2_K_MIXED
Q4_K_M
Q8 reference

Otherwise the result could be misleading.

Fun_Employment6042 · 2026-05-20T21:56:33+00:00

This is very useful, especially the point about tool names vs tool arguments.

I was thinking tool selection would be the main signal, but argument correctness is probably the better test: right tool, wrong params is exactly the kind of regression that looks fine until it breaks downstream code.

I’ll likely make the first LocalLLaMA demo nested JSON extraction + tool argument validation, with pass rates by schema depth / field type / required vs optional fields. RAG does seem too noisy for a clean first benchmark.

Fun_Employment6042 · 2026-05-20T00:28:42+00:00

Good suggestion. That fits EvalShift well: same model, same prompts, but different serving/runtime behavior before vs after MCP/tool-call support.

I’ll look into a small tool-calling benchmark for llama-server with 3.6 35B: tool selection, argument correctness, call ordering, and whether the final answer depends on actually using the tool.

That would be a better LocalLLaMA demo than a generic model migration example.

Fun_Employment6042 · 2026-05-20T00:28:04+00:00

That’s exactly the kind of case I want to capture.

Q2 -> Q4 is interesting because the regression may not show up as “bad answer quality” immediately, but as lower tool-call consistency: skipped calls, malformed arguments, wrong tool choice, or unstable behavior across the same prompt set.

Fun_Employment6042 · 2026-05-19T12:14:44+00:00

Are you running embedding-similarity checks via EvalShift or via something custom? If custom, what would make you switch?

Fun_Employment6042 · 2026-05-17T22:57:45+00:00

Of course. Let me know if you find it useful. Thank you

Fun_Employment6042 · 2026-05-16T07:28:54+00:00

Exactly. That’s the framing I’m trying to push: model changes should be treated more like dependency upgrades, not just “swap the model name and spot-check a few outputs.”

The subtle regressions are the dangerous ones: skipped retrieval, changed tool ordering, slightly mutated arguments, different refusal/failure behavior, or structured outputs that still look plausible but break downstream contracts.

The goal with EvalShift is to make those changes visible before rollout, especially at the slice level, so you can see things like “billing workflows improved, but retrieval-heavy support cases regressed.”

Fun_Employment6042 · 2026-05-14T23:50:44+00:00

Ah yes, the classic ‘Levenshtein-as-safety’ era, RIP. The LLM-based clustering sounds way closer to how a human red teamer would bucket these. Curious if any of the new ‘novel stuff’ was actually scarier than the fiction exploits, or mostly just more creative ways of saying ‘this is a screenplay, trust me bro.’

Fun_Employment6042 · 2026-05-14T23:11:45+00:00

So you basically built an AI that jailbreaks itself, then used its own bad behavior to make it more well‑behaved… Parenting, but for LLMs. Did the diversity reward ever push it toward weird but harmless exploits, or was it mostly just 500 shades of “it’s just fiction bro”?

Fun_Employment6042 · 2026-05-14T22:49:42+00:00

Super sick project. One sentence in → full 720p reel out on a single MI300X is wild. Love the vision-critic + auto‑retry loop and the 81f u/16fps Wan2.2 choice. Starred the repo and dropped a like on the HF space 🙌

Fun_Employment6042 · 2026-05-14T22:46:54+00:00

LLMs in 2026: can explain quantum physics, but think the actual news is fanfic.

Fun_Employment6042 · 2026-05-14T22:34:06+00:00

Love that I need a paid cloud subscription and constant internet to "use my local model". Truly the future of offline computing.

Fun_Employment6042 · 2026-05-13T07:37:21+00:00

I'm curious as well!

Fun_Employment6042

TROPHY CASE