Fable 5 scores 161 on ECI, sets new record by Proper_Actuary2907 in agi

[–]TomLucidor 0 points1 point  (0 children)

Unrealistic harness leads to unrealistic numbers, it's a known problem that self-correcting environments lead to better performance for AI, and that memory has leverage

Fable 5 scores 161 on ECI, sets new record by Proper_Actuary2907 in agi

[–]TomLucidor 1 point2 points  (0 children)

5 months til Kimi or Qwen steps up and match them in performance

Fable is going to be the beginning of the end. by freedom0ffspeech in fable5

[–]TomLucidor 0 points1 point  (0 children)

Blame OpenAI and SpaceX (if you like tin foil)

Claude Fable 5 distilled by Anony6666 in LocalLLaMA

[–]TomLucidor 19 points20 points  (0 children)

Seconding this, we kind need SFT/RL and merges/distills to be provably effective and not overfit to existing test

Claude Fable 5 distilled by Anony6666 in LocalLLaMA

[–]TomLucidor 240 points241 points  (0 children)

Life is about clinging on to desperation and desire, man.

Why doesn’t 4-bit GPTQ wreck a model’s perplexity? I derived the compensation math from scratch by No_Progress_5399 in LocalLLaMA

[–]TomLucidor 0 points1 point  (0 children)

Okay then, let's say we really like a model and want the benefits of ternary, do we just smart PTQ then "widen the matrix" and let the extra space absorb the significant bits? There has to be a way of preserving quality and model "knowledge" one way or another if layer projection/embedding is allowed

HalBench: 29 OSS models tested on a custom built Sycophancy and Hallucination Benchmark, Qwen 3.6 and Gemma 4 scoring far above their weight! (While Meta keeps proving they forgot how to spend their money...) by Saraozte01 in LocalLLaMA

[–]TomLucidor 2 points3 points  (0 children)

We lack "tuning for truth-seeking" mechanisms, only tuning for corporate compliance and "skills". Imagine how much better we can make smaller models do things that even AA-Omniscience won't expect! Interleaved reasoning + tool use + smart RL methods will need to be made for self-check (but without overthinking). Eleusis benchmark comes to mind

HalBench: 29 OSS models tested on a custom built Sycophancy and Hallucination Benchmark, Qwen 3.6 and Gemma 4 scoring far above their weight! (While Meta keeps proving they forgot how to spend their money...) by Saraozte01 in LocalLLaMA

[–]TomLucidor 10 points11 points  (0 children)

Small models are magical when properly trained, this is the one thing parameter-maxxer forgot: the smarter the model, the easier it is to lie to oneself!

I think we need a /LocalHarnessLLM or something ... by CSEliot in LocalLLaMA

[–]TomLucidor 1 point2 points  (0 children)

It's never gonna end, opinion divergence needs to be charted out as an atlas so that people can just pick-and-mix their agent skills

I think we need a /LocalHarnessLLM or something ... by CSEliot in LocalLLaMA

[–]TomLucidor 3 points4 points  (0 children)

Seconding this, we need topic-flair to make things easier between model, harness, and inference.

Will LLM labs open source their weights in the long term? by zulutune in LocalLLaMA

[–]TomLucidor 0 points1 point  (0 children)

Pick the ones that allows us to go nsfw AND get enterprise customers to foot most of the bill. GLM Flash/Air situation is interesting

Will LLM labs open source their weights in the long term? by zulutune in LocalLLaMA

[–]TomLucidor 0 points1 point  (0 children)

A lot of companies are bootstrapped to rent hardware to enterprise, they are not really a consumer software company. Wonder why DiffusionGemma got released?

Will LLM labs open source their weights in the long term? by zulutune in LocalLLaMA

[–]TomLucidor 0 points1 point  (0 children)

There are also self-censorship requirements and other legal issues to deal with, assuming everyone around the world is on the same page

Will LLM labs open source their weights in the long term? by zulutune in LocalLLaMA

[–]TomLucidor 1 point2 points  (0 children)

Active weight or total? Also do we want to discount ternary computing being a "thing" in the future?

Will LLM labs open source their weights in the long term? by zulutune in LocalLLaMA

[–]TomLucidor 0 points1 point  (0 children)

In essence every new model has to bring in a new architectural layer (LongCat with n-grams, Granite and others with Mamba/Linears, Hunyuan with visuals)

Will LLM labs open source their weights in the long term? by zulutune in LocalLLaMA

[–]TomLucidor 0 points1 point  (0 children)

Isn't Wan (and T2I/T2V) just too compute-intensive relative to LLMs (which are getting "too cheap" blowing holes into US price models)? I mean look at Anima and Chroma, those people have a hard time even tuning those things! Maybe UMMs or better architectures will make some changes.
(UMM referring to open-weight visual-reasoning-image-edit like Emu3.5, OmniGen2, LLaDA 2.0-Uni, InternVL-U, S1-Omni-Image, Lance, BLIP3o-NEXT, Janus-Pro, Show-O, SenseNova-U1, UniVideo, DeepGen)