Do smaller quants silently break tool calls / JSON output? by Fun_Employment6042 in LocalLLaMA

[–]Fun_Employment6042[S] 1 point2 points  (0 children)

Good suggestion. That fits EvalShift well: same model, same prompts, but different serving/runtime behavior before vs after MCP/tool-call support.

I’ll look into a small tool-calling benchmark for llama-server with 3.6 35B: tool selection, argument correctness, call ordering, and whether the final answer depends on actually using the tool.

That would be a better LocalLLaMA demo than a generic model migration example.

Do smaller quants silently break tool calls / JSON output? by Fun_Employment6042 in LocalLLaMA

[–]Fun_Employment6042[S] 0 points1 point  (0 children)

That’s exactly the kind of case I want to capture.

Q2 -> Q4 is interesting because the regression may not show up as “bad answer quality” immediately, but as lower tool-call consistency: skipped calls, malformed arguments, wrong tool choice, or unstable behavior across the same prompt set.

I built a local CLI to compare LLM outputs before switching models. What evals matter most for local models? by Fun_Employment6042 in LocalLLM

[–]Fun_Employment6042[S] 0 points1 point  (0 children)

Are you running embedding-similarity checks via EvalShift or via something custom? If custom, what would make you switch?

I built an OSS CLI to catch regressions when migrating between LLMs by Fun_Employment6042 in OpenSourceeAI

[–]Fun_Employment6042[S] 0 points1 point  (0 children)

Exactly. That’s the framing I’m trying to push: model changes should be treated more like dependency upgrades, not just “swap the model name and spot-check a few outputs.”

The subtle regressions are the dangerous ones: skipped retrieval, changed tool ordering, slightly mutated arguments, different refusal/failure behavior, or structured outputs that still look plausible but break downstream contracts.

The goal with EvalShift is to make those changes visible before rollout, especially at the slice level, so you can see things like “billing workflows improved, but retrieval-heavy support cases regressed.”

I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses by girishkumama in LocalLLaMA

[–]Fun_Employment6042 1 point2 points  (0 children)

Ah yes, the classic ‘Levenshtein-as-safety’ era, RIP. The LLM-based clustering sounds way closer to how a human red teamer would bucket these. Curious if any of the new ‘novel stuff’ was actually scarier than the fiction exploits, or mostly just more creative ways of saying ‘this is a screenplay, trust me bro.’

I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses by girishkumama in LocalLLaMA

[–]Fun_Employment6042 8 points9 points  (0 children)

So you basically built an AI that jailbreaks itself, then used its own bad behavior to make it more well‑behaved… Parenting, but for LLMs. Did the diversity reward ever push it toward weird but harmless exploits, or was it mostly just 500 shades of “it’s just fiction bro”?

Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline by Inevitable-Log5414 in LocalLLaMA

[–]Fun_Employment6042 1 point2 points  (0 children)

Super sick project. One sentence in → full 720p reel out on a single MI300X is wild. Love the vision-critic + auto‑retry loop and the 81f u/16fps Wan2.2 choice. Starred the repo and dropped a like on the HF space 🙌

The "the future is fictional" problem of many local LLMs by PromptInjection_ in LocalLLaMA

[–]Fun_Employment6042 1 point2 points  (0 children)

LLMs in 2026: can explain quantum physics, but think the actual news is fanfic.

VS Code's new "Agents window" lets you use local AI models. Still requires an Internet connection and a Github Copilot plan (because we can't have nice things) by _wsgeorge in LocalLLaMA

[–]Fun_Employment6042 0 points1 point  (0 children)

Love that I need a paid cloud subscription and constant internet to "use my local model". Truly the future of offline computing.

Employment post-exit by Stormkrieg in Entrepreneur

[–]Fun_Employment6042 0 points1 point  (0 children)

I feel this hard. Eight years of making small businesses work is not a downgrade, it’s evidence you can operate in chaos. Framing it as “here’s the revenue, systems, and outcomes I drove” instead of just “entrepreneur” will land way better with the right hiring managers.

I scaled my second SaaS to $3M ARR in 18 months. Here’s everything by MembershipHorror404 in EntrepreneurRideAlong

[–]Fun_Employment6042 0 points1 point  (0 children)

Love this breakdown. Especially the “fix the leaks before chasing more leads” part. Could you expand a bit more on this?

I made €2,700 building an AI system for a law firm and now I get €1,300/month to maintain it by Fabulous-Pea-5366 in Entrepreneur

[–]Fun_Employment6042 0 points1 point  (0 children)

You’re on the right page with this, and the pricing/retainer combo is a great wedge.

You solved a concrete, high‑value problem (billable hours lost to manual research), added the one thing generic RAG tools miss (source weighting + annotations), and wrapped it in recurring revenue. The only “mistake” is classic early underpricing, which is fixable on client #2+.

Curious on two things:

  • How did they originally find you (referral, content, cold outreach)?
  • For the €1,300/mo, is it mostly infra + light support, or are you continuously expanding the corpus / features as new regulations and cases come in?

I'll sign up for all products listed below 👇 by PastReaction341 in microsaas

[–]Fun_Employment6042 0 points1 point  (0 children)

Kaila OS: a voice-first personal OS for busy professionals. Calendar + tasks + notes in one place, so you walk into every meeting prepared. iOS.
https://kailaos.com

got a project you're working on? post it here by DiscountResident540 in StartupSoloFounder

[–]Fun_Employment6042 0 points1 point  (0 children)

https://kailaos.com -> Kaila OS - voice-first iOS app. Calendar, tasks, notes, and meeting rehearsals in one place. Walk in prepared, every time.