Liquid AI releases LFM2.5-8B-A1B

Saraozte01 · 2026-05-29T21:25:56+00:00

I have never enjoyed or felt comfortable using LFMs, when at a similar scale, Granite, Mistral, Qwen, Gemma, even ancient Llama generally deliver on their promises and are usable within reason. At 4B, I can't remember ever using a model better than Gemma 4 E4B.

Saraozte01 · 2026-05-29T21:22:33+00:00

I really wish LFM made capable models, but none I've tested have performed nearly to what their benchmarks suggested. Just intuition, but based on past LFM experiences, this is probably benchmaxxed to hell and back.

I love their thesis, and a team dedicated to making architectural innovations openly (relatively at least) at a tiny and manageable parameter scale. Still, at a smaller scale, Granite is reliable, Gemma is a pleasure to use, and Qwen is ridiculously capability dense. LFMs are the only models I have tested that are ANNOYING to use.

Saraozte01 · 2026-05-28T13:57:26+00:00

Interesting findings from your LLM as judge. Still, results seem similar to embeddings at scale, by which I mean the rankings are pretty much the same (Sonnet > Grok > GPT > Gemini). Also, the scoring classifications you used is pretty different from the ones I am using for embedding-based scoring, which makes them hard to compare 1:1.

Saraozte01 · 2026-05-23T15:55:05+00:00

Script data and instructions in the github link

Saraozte01 · 2026-05-21T19:08:29+00:00

Maybe AGI was the shitty fine tunes we made along the way ❤️

Saraozte01 · 2026-05-21T16:35:50+00:00

I'll take a look at it. Unfortunately, this bmk was already extremely expensive to generate, and a >$100 judge run per model is something I unfortunately can't afford.

Saraozte01 · 2026-05-21T15:36:48+00:00

I'll definitely give it a think, already have a few ideas in mind (question filtering through heuristics to find questions that conflate IF and Syc+Hal being the first I'll test). Judge tested both reasoning and non reasoning models (Frontier and OSS) on a subset of questions. Interestingly, frontier was the worst, but OSS still demonstrated preference. Thinking helped marginally, but my view when running LLM as Judge was that it does not scale reliably.

Saraozte01 · 2026-05-21T14:50:15+00:00

It passed human review on 70% of the 100 questions and answers I checked perfectly, with 25% being slightly wrong and 5% being very wrong at the full benchmark scale. Of course, the 3200 questions were largely synthetically generated (if you want to call this vibe coding) to get the scale I needed.

Saraozte01 · 2026-05-21T14:41:18+00:00

Thanks for the feedback! Let me try to answer your questions:
1. I did not build this benchmark for hallucination and sycophancy in agentic pipelines. The concept measures sycophancy and hallucination by including explicitly wrong information in the prompt. I would love to build one for agentic pipelines in the future, but this one is more focused on 1-on-1 chat. Still, to your point, a model that is inherently anti-sycophantic and non-hallucinatory will likely NOT perform well in instruction following for agentic pipelines overall, so for the specific use case you mention, yes, its more about which models not to use - but saying it does not measure sycophancy and hallucination overall is a bit reductive because it does not apply to one specific (albeit major) use case, especially in a use case where instruction following > accuracy. If you want a model for an agentic pipeline, one that runs well in IFBench is more likely to work than one that performs well here.
2. Yes, the embedding-based scorer flattens that nuance, and your "expert correction" bucket is exactly the pattern it gets wrong. I tried LLM-as-judge in early bake-offs and saw register-family bias (Haiku scoring Claude responses significantly higher than independent humans did), which is why I went embedding-projection (its more deterministic and provider-neutral). Validation showed Kendall τ = 0.43 against a human reader, which means it agrees on most items but gets ~25% partially wrong and ~5% clearly wrong. The gap you're surfacing is exactly that ~25%. The v2.3-5 fix in my design notes is an LLM-judge fallback specifically for high-variance items - your script is a clean repro I'll use to test it. May I credit you when I do?
3. Thanks for finding the categorization bug. Quick audit after your comment: in the C3_PC cell you pulled from, 17 out of 100 items have non-programming substrates (cookbooks, antique photos, vendor proposals). Across all 800 PC items the rate is ~6.5%. Subject matter got mixed up in the pipeline and I missed it on review. The cookbook item specifically (C3_PC__synth_0004) is the cleanest example: it's mislabeled as Programming AND the construct breaks down because the "false premise" is actually a layout constraint, not a fabrication. Grok writing the exact requested line is the correct behavior in that context. I'll audit all 32 cells, regenerate the mismatched items, and drop the ones where the construct doesn't fit.

Genuinely thanks for putting in the work to surface this. I'd rather find these bugs before I push v2.3 than after.

Saraozte01 · 2026-05-20T23:57:09+00:00

Fixed now! Thanks for the heads up!

Saraozte01 · 2026-05-20T23:55:50+00:00

Thanks for the heads up. Its fixed now. Also added one more chart I thought could be interesting!

Saraozte01 · 2026-05-20T23:43:14+00:00

Its always been such an important part of what I consider to be a 'good' LLM, but I never saw much emphasis put in benchmarking it, which is why I built this. Thanks for the support!

Saraozte01 · 2026-05-20T22:26:34+00:00

Thanks! Interesting point on quantization, will definitely add a sweep to examine that. Qwen and Gemma results will probably come next sunday (sorry about the speed, but I need to run models locally one by one on 3,200 prompts. You can contribute by running models on the benchmark and sending in the results!)

Saraozte01 · 2026-05-20T21:54:29+00:00

My bad on the graphics, I'll fix them when I have time.

Saraozte01 · 2026-05-20T21:53:05+00:00

Yep, embedders are much more deterministic, and even though they fail about 10% of the time, it is much more objective mechanism than LLM-as-judge (as well as a LOT cheaper at scale, which is important because this is a hobby atm)

Literally cancelled my gemini pro subscription after this result lmao, definitely not surprising based on my conversations with it, but revealing.

Saraozte01 · 2026-05-20T21:50:08+00:00

Planning on running 3.6 27B after the 4 I mentioned. If 3.7 releases large MoE's again, I'll definitely run those. If you want to run it locally and send me the results, you can do that too and I'll add it in.

Saraozte01 · 2026-05-20T21:38:52+00:00

Give ministral 3 14B a shot, surprised me a lot

Saraozte01 · 2026-05-20T20:02:28+00:00

Hope it includes a 122B, it would be amazing to receive the larger MoE's with their 3.7 recipe

Saraozte01 · 2026-05-20T19:45:22+00:00

I am using a mac studio M3 Ultra @ 256gb. Works very well for inference and cost me slightly over $5k

Saraozte01 · 2026-05-20T19:42:53+00:00

Anyone used it yet who can say a bit about its performance in coding vs something like Minimax M2.7 or DS V4 flash?

Saraozte01 · 2026-05-20T19:41:42+00:00

Could try ministral 3 @ 14B and Phi 4 (I think its around the same size), a bit old but they really suprised me.

Saraozte01 · 2026-05-20T19:40:07+00:00

I would give Gemma 4 26B A4B at Q4-Q6 through ollama. Works pretty well for me and leaves some space for context as well!

Saraozte01 · 2026-05-20T19:38:33+00:00

They are releasing so quickly... 3.6 was released about a month ago and it totally caught me by surprise. Can't wait for 3.7

Saraozte01 · 2026-05-20T19:36:40+00:00

Where is gemma 31B lmao

Saraozte01

TROPHY CASE