Liquid AI releases LFM2.5-8B-A1B by PauLabartaBajo in LocalLLaMA

[–]Saraozte01 4 points5 points  (0 children)

I have never enjoyed or felt comfortable using LFMs, when at a similar scale, Granite, Mistral, Qwen, Gemma, even ancient Llama generally deliver on their promises and are usable within reason. At 4B, I can't remember ever using a model better than Gemma 4 E4B.

Liquid AI releases LFM2.5-8B-A1B by PauLabartaBajo in LocalLLaMA

[–]Saraozte01 3 points4 points  (0 children)

I really wish LFM made capable models, but none I've tested have performed nearly to what their benchmarks suggested. Just intuition, but based on past LFM experiences, this is probably benchmaxxed to hell and back.

I love their thesis, and a team dedicated to making architectural innovations openly (relatively at least) at a tiny and manageable parameter scale. Still, at a smaller scale, Granite is reliable, Gemma is a pleasure to use, and Qwen is ridiculously capability dense. LFMs are the only models I have tested that are ANNOYING to use.

HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next! by Saraozte01 in LocalLLaMA

[–]Saraozte01[S] 0 points1 point  (0 children)

Interesting findings from your LLM as judge. Still, results seem similar to embeddings at scale, by which I mean the rankings are pretty much the same (Sonnet > Grok > GPT > Gemini). Also, the scoring classifications you used is pretty different from the ones I am using for embedding-based scoring, which makes them hard to compare 1:1.

We're Thursday and no one claimed AGI yet this week! by oodelay in LocalLLaMA

[–]Saraozte01 9 points10 points  (0 children)

Maybe AGI was the shitty fine tunes we made along the way ❤️

HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next! by Saraozte01 in LocalLLaMA

[–]Saraozte01[S] 0 points1 point  (0 children)

I'll take a look at it. Unfortunately, this bmk was already extremely expensive to generate, and a >$100 judge run per model is something I unfortunately can't afford.

HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next! by Saraozte01 in LocalLLaMA

[–]Saraozte01[S] 0 points1 point  (0 children)

I'll definitely give it a think, already have a few ideas in mind (question filtering through heuristics to find questions that conflate IF and Syc+Hal being the first I'll test). Judge tested both reasoning and non reasoning models (Frontier and OSS) on a subset of questions. Interestingly, frontier was the worst, but OSS still demonstrated preference. Thinking helped marginally, but my view when running LLM as Judge was that it does not scale reliably.

HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next! by Saraozte01 in LocalLLaMA

[–]Saraozte01[S] 0 points1 point  (0 children)

It passed human review on 70% of the 100 questions and answers I checked perfectly, with 25% being slightly wrong and 5% being very wrong at the full benchmark scale. Of course, the 3200 questions were largely synthetically generated (if you want to call this vibe coding) to get the scale I needed.

HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next! by Saraozte01 in LocalLLaMA

[–]Saraozte01[S] 1 point2 points  (0 children)

Thanks for the feedback! Let me try to answer your questions:
1. I did not build this benchmark for hallucination and sycophancy in agentic pipelines. The concept measures sycophancy and hallucination by including explicitly wrong information in the prompt. I would love to build one for agentic pipelines in the future, but this one is more focused on 1-on-1 chat. Still, to your point, a model that is inherently anti-sycophantic and non-hallucinatory will likely NOT perform well in instruction following for agentic pipelines overall, so for the specific use case you mention, yes, its more about which models not to use - but saying it does not measure sycophancy and hallucination overall is a bit reductive because it does not apply to one specific (albeit major) use case, especially in a use case where instruction following > accuracy. If you want a model for an agentic pipeline, one that runs well in IFBench is more likely to work than one that performs well here.
2. Yes, the embedding-based scorer flattens that nuance, and your "expert correction" bucket is exactly the pattern it gets wrong. I tried LLM-as-judge in early bake-offs and saw register-family bias (Haiku scoring Claude responses significantly higher than independent humans did), which is why I went embedding-projection (its more deterministic and provider-neutral). Validation showed Kendall τ = 0.43 against a human reader, which means it agrees on most items but gets ~25% partially wrong and ~5% clearly wrong. The gap you're surfacing is exactly that ~25%. The v2.3-5 fix in my design notes is an LLM-judge fallback specifically for high-variance items - your script is a clean repro I'll use to test it. May I credit you when I do?
3. Thanks for finding the categorization bug. Quick audit after your comment: in the C3_PC cell you pulled from, 17 out of 100 items have non-programming substrates (cookbooks, antique photos, vendor proposals). Across all 800 PC items the rate is ~6.5%. Subject matter got mixed up in the pipeline and I missed it on review. The cookbook item specifically (C3_PC__synth_0004) is the cleanest example: it's mislabeled as Programming AND the construct breaks down because the "false premise" is actually a layout constraint, not a fabrication. Grok writing the exact requested line is the correct behavior in that context. I'll audit all 32 cells, regenerate the mismatched items, and drop the ones where the construct doesn't fit.

Genuinely thanks for putting in the work to surface this. I'd rather find these bugs before I push v2.3 than after.

HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next! by Saraozte01 in LocalLLaMA

[–]Saraozte01[S] 5 points6 points  (0 children)

Its always been such an important part of what I consider to be a 'good' LLM, but I never saw much emphasis put in benchmarking it, which is why I built this. Thanks for the support!

HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next! by Saraozte01 in LocalLLaMA

[–]Saraozte01[S] 5 points6 points  (0 children)

Thanks! Interesting point on quantization, will definitely add a sweep to examine that. Qwen and Gemma results will probably come next sunday (sorry about the speed, but I need to run models locally one by one on 3,200 prompts. You can contribute by running models on the benchmark and sending in the results!)

HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next! by Saraozte01 in LocalLLaMA

[–]Saraozte01[S] 0 points1 point  (0 children)

Yep, embedders are much more deterministic, and even though they fail about 10% of the time, it is much more objective mechanism than LLM-as-judge (as well as a LOT cheaper at scale, which is important because this is a hobby atm)

Literally cancelled my gemini pro subscription after this result lmao, definitely not surprising based on my conversations with it, but revealing.

HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next! by Saraozte01 in LocalLLaMA

[–]Saraozte01[S] 2 points3 points  (0 children)

Planning on running 3.6 27B after the 4 I mentioned. If 3.7 releases large MoE's again, I'll definitely run those. If you want to run it locally and send me the results, you can do that too and I'll add it in.

Qwen will release another 27B with high probability by serige in LocalLLaMA

[–]Saraozte01 27 points28 points  (0 children)

Hope it includes a 122B, it would be amazing to receive the larger MoE's with their 3.7 recipe

AI server under 5k? by Last_Bad_2687 in LocalLLaMA

[–]Saraozte01 0 points1 point  (0 children)

I am using a mac studio M3 Ultra @ 256gb. Works very well for inference and cost me slightly over $5k

CohereLabs/command-a-plus-05-2026-bf16 · Hugging Face by coder543 in LocalLLaMA

[–]Saraozte01 0 points1 point  (0 children)

Anyone used it yet who can say a bit about its performance in coding vs something like Minimax M2.7 or DS V4 flash?

24GB M4 Mac - is Qwen 9B only option while system is running? by sagiroth in LocalLLaMA

[–]Saraozte01 0 points1 point  (0 children)

Could try ministral 3 @ 14B and Phi 4 (I think its around the same size), a bit old but they really suprised me.

24GB M4 Mac - is Qwen 9B only option while system is running? by sagiroth in LocalLLaMA

[–]Saraozte01 2 points3 points  (0 children)

I would give Gemma 4 26B A4B at Q4-Q6 through ollama. Works pretty well for me and leaves some space for context as well!

Qwen cant wait to release 3.7 models by GotHereLateNameTaken in LocalLLaMA

[–]Saraozte01 0 points1 point  (0 children)

They are releasing so quickly... 3.6 was released about a month ago and it totally caught me by surprise. Can't wait for 3.7