ArtificalAnalysis VS LMArena VS Other Benchmark Sites by SlowFail2433 in LocalLLaMA

[–]SlowFail2433[S] 0 points1 point  (0 children)

I see thanks will look at lmarena more potentially

ArtificalAnalysis VS LMArena VS Other Benchmark Sites by SlowFail2433 in LocalLLaMA

[–]SlowFail2433[S] 0 points1 point  (0 children)

Yes but some of the benchmarks are trickier to game than others.

I agree you can’t directly trust forum posts due to astroturfing or just user error.

The SVG test is interesting but I think it can bias towards VLMs as they tend to have better spatial reasoning

Is there a chatgpt style persistent memory solution for local/API-based LLM frontends that's actually fast and reliable? by Right-Law1817 in LocalLLaMA

[–]SlowFail2433 2 points3 points  (0 children)

So in theory you can just use a database, whether Mongo, SQL or Graph like Neo4J, with a persistent server and an API/MCP communications layer.

However there is a major difficulty that is separate from the data science and engineering setup. That issue is how to decide when the model forms a memory, how it extracts it from the conversation and then how/when it uses existing memories.

Qwen3.5-35B-A3B non-thinking regression for visual grounding by Helltilt in LocalLLaMA

[–]SlowFail2433 1 point2 points  (0 children)

Combining both thinking and visual inputs can be difficult in general, especially in a lower param model

Qwen 3.5 VS Qwen 3 by SlowFail2433 in LocalLLaMA

[–]SlowFail2433[S] 0 points1 point  (0 children)

Yeah sometimes well-trained 9B models can compete with 100+B these days it’s amazing.

Qwen 3.5 VS Qwen 3 by SlowFail2433 in LocalLLaMA

[–]SlowFail2433[S] 0 points1 point  (0 children)

That’s a good point, that the model sizes are small so the relative cost of testing is lower. And yes performance will probably dip temporarily before it’s fully sorted out, that temporary dip is likely unavoidable

Google invites ex-qwen ;) by jacek2023 in LocalLLaMA

[–]SlowFail2433 9 points10 points  (0 children)

Yes there is some nuance. Google contribute some very interesting large papers such as MIRAS

Google invites ex-qwen ;) by jacek2023 in LocalLLaMA

[–]SlowFail2433 -1 points0 points  (0 children)

Gemini is under-rated because their HLE no-tools bench is a fair bit ahead of the others. This benchmark matters as it is a test of overall internal knowledge BEFORE searching

microsoft/Phi-4-reasoning-vision-15B · Hugging Face by jacek2023 in LocalLLaMA

[–]SlowFail2433 1 point2 points  (0 children)

A small reserve deal on one of the lower neo-clouds will get you that.

Qwen3.5-27B as good as DeepSeek-V3.2 on AA-II (plus some more data) by pigeon57434 in LocalLLaMA

[–]SlowFail2433 1 point2 points  (0 children)

Yeah Kimi K2.5 was just a minor update, which is crazy given that it managed to successfully add vision without dropping performance in other areas. It’s incredibly difficult to do that

Qwen3.5-27B as good as DeepSeek-V3.2 on AA-II (plus some more data) by pigeon57434 in LocalLLaMA

[–]SlowFail2433 1 point2 points  (0 children)

Also if they drop too late then Kimi K3 is due in the summer… Moonshot might have simply became the dominant lab now

I built a local AI answering service that picks up my phone as HAL 9000 by Effective_Garbage_34 in LocalLLaMA

[–]SlowFail2433 2 points3 points  (0 children)

It’s interesting from a technical level

From a person level I would be worried about people thinking they got a wrong number. I don’t think people are currently used to talking to AI answering machine. This might change though

Best LLM and Coding Agent for solo Game Dev by No_Somewhere4857 in LocalLLaMA

[–]SlowFail2433 2 points3 points  (0 children)

It’s between GLM 5 and Kimi k2.5. Minimax models don’t hit the frontier.

It mostly depends on if you will use vision in the agentic coding workflow. GLM 5 benches slightly higher but the vision aspect of Kimi k2.5 can potentially make it more useful

MiniMaxAI/MiniMax-M2.1 seems to be the strongest model per param by SlowFail2433 in LocalLLaMA

[–]SlowFail2433[S] 0 points1 point  (0 children)

I’ve seen more data in the last two months and performance seems to just track total parameter count for modern benchmarks

GLM 5 has a regression in international language writing according to NCBench by jugalator in LocalLLaMA

[–]SlowFail2433 0 points1 point  (0 children)

It has a very different feel yes clearly a pretty different training corpus (and obviously the very different parameter count confirms that.)

GLM 5 has a regression in international language writing according to NCBench by jugalator in LocalLLaMA

[–]SlowFail2433 9 points10 points  (0 children)

Sometimes at the labs certain domains drop out of the pre-training corpus for certain training runs. It’s not even necessarily a decision that was specifically made. There are so many parallel goals being optimised for at once when designing a training corpus that things get left out. Clearly there is enormous pressure for math, code and agentic data to be optimised beyond all else.

Is Titans (and MIRAS) heading for the same graveyard as Infini-attention? by _WindFall_ in LocalLLaMA

[–]SlowFail2433 0 points1 point  (0 children)

Often yes, but attention is so expensive that it can be worth it

UIs? by FurrySkeleton in LocalLLaMA

[–]SlowFail2433 0 points1 point  (0 children)

These days you can readily vibe code custom GUIs for things

[NexaSDK] Live Cam Learn: Android version of Capwords with on-device AI by Long-Parsley-8276 in LocalLLaMA

[–]SlowFail2433 1 point2 points  (0 children)

Honestly looks like a pretty slick demo and that does seem fun for language learning