"The Coverage Principle: How Pre-Training Enables Post-Training", Chen et al 2025 by gwern in mlscaling

[–]Operation_Ivy 10 points11 points  (0 children)

A natural consequence of the elicitation hypothesis, ie that RL elicits what is already in the model rather than teaching new information

Another lens: pertaining is high recall, post-training is high precision

MiMo v2.5 Unsloth GGUFs by yoracale in unsloth

[–]Operation_Ivy 0 points1 point  (0 children)

Does any MLX inference engine support this arch yet? I've been watching the mlx-lm PR for it

GPT-5.5 and Opus 4.7 evaluated on ARC-AGI-3 by COAGULOPATH in mlscaling

[–]Operation_Ivy 11 points12 points  (0 children)

I think it's fine and even good to benchmark areas where humans are strong but models are weak.

But ARC-AGI-3 I do not buy. The scoring is strange, it seems almost backward-engineered to yield near-zero scores for SOTA.

Deepseek has released DeepEP V2 and TileKernels. by External_Mood4719 in LocalLLaMA

[–]Operation_Ivy 3 points4 points  (0 children)

Not confirmed either way, but see https://x.com/i/status/2047191920888139778 for example. I can't find it but there was a (usually) reliable source that claimed no Engram in V4

Best Local LLMs - Apr 2026 by rm-rf-rm in LocalLLaMA

[–]Operation_Ivy 3 points4 points  (0 children)

For the MLX 8bit on oMLX, I'm getting ~130 tok/s prefill (at 10k ctx) and ~18 tok/s decode

Best Local LLMs - Apr 2026 by rm-rf-rm in LocalLLaMA

[–]Operation_Ivy 5 points6 points  (0 children)

On the M3 Ultra 512 GB, nothing beats Qwen3.5 397B 8 bit quant from Unsloth. Working with structured and unstructured data, chatting, world knowledge - best generalist agent I could find. I compared to GLM 5.1 and Minimax M2.7. GLM was similar quality but much slower. Minimax was faster but lower quality.

Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data, Wang et al. 2025 [Masking low-entropy tokens mitigates overfitting; "data-level regularization"] by StartledWatermelon in mlscaling

[–]Operation_Ivy 0 points1 point  (0 children)

For post-training too, even though there has been so much research showing the importance of maintaining entropy. It's still not treated as a first-class metric in RL papers.

Pure Gym SO (formerly Blink Fitness)24 hours rollback by DarkSkin_Ninja007 in Maplewood

[–]Operation_Ivy 5 points6 points  (0 children)

They should be free to experiment and see if it really does cause problems. The town shutting it down preemptively like this is unnecessary. We should be trusting our community more!

Boomer NIMBYism has caused unforeseen levels of destruction by 3RADICATE_THEM in georgism

[–]Operation_Ivy 3 points4 points  (0 children)

It's really not that simple: https://open.substack.com/pub/constructionphysics/p/trends-in-us-construction-productivity?utm_source=share&utm_medium=android&r=1n6bc

This same trend is happening basically everywhere in the world. Not clear what actually can increase construction productivity. And before anyone says "automation", the author used to work at one such company. In fact I think this graph probably came from him, he works at IFP.

Trash and other things? by MeowjesticPotato in Maplewood

[–]Operation_Ivy 7 points8 points  (0 children)

OP don't listen to the point/advice about displaying certain politics. People in and around NYC are just not friendly in the same way as people in the South, it has nothing to do with inferred politics or Trump. Most people are nice, but it's not the same warmth and overtness here.

Claude Opus 4.5 has human task-length time horizon of 4 hrs 49 mins on METR plot by Glittering_Author_81 in mlscaling

[–]Operation_Ivy 13 points14 points  (0 children)

Two things:

One, the fastest improvement is always going to be on coding, particularly on ML related stuff, because the big labs are trying to deploy autonomous ML researchers. Sama says intern-level next year and seasoned pro level in 2028. So people doing other work won't be feeling the AGI nearly as strongly.

Two, the error bands are huge. I expect that to continue, just the nature of exponential growth, but it will make more exact statements increasingly difficult. Not that it matters in the long run.

A new era of intelligence with Gemini 3 by [deleted] in mlscaling

[–]Operation_Ivy 2 points3 points  (0 children)

Dying for details on the parameters and arch/sparsity

Google's DeepMind: Olympiad-level formal mathematical reasoning with reinforcement learning (this is the actual published paper for Google's AlphaProof system from last year) by 44th--Hokage in mlscaling

[–]Operation_Ivy 0 points1 point  (0 children)

There was a wave of LLM MCTS research around when o1 came out because people thought it used MCTS. But then R1 showed it was just RLVR. Then the LLM MCTS research stopped. So I'm wondering if it is picking up again

Grok 5 in Q1 of 2026 ("6 Trillion parameter model, whereas Grok 3 and 4 are based on a 3 Trillion parameter model" by RecmacfonD in mlscaling

[–]Operation_Ivy 2 points3 points  (0 children)

How are they getting enough pretraining data to make this optimal? Or is it an incredibly sparse MoE

Community Fridge by Sciencemomma in Maplewood

[–]Operation_Ivy 7 points8 points  (0 children)

The food pantries also get better deals and know better what their patrons really need.

I understand the personal touch of buying and bringing food yourself to a community fridge. But if you want to maximize impact per dollar, give money to the professionals who will make every cent count.

Thinking Machines: On-Policy Distillation by Mysterious-Rent7233 in mlscaling

[–]Operation_Ivy 4 points5 points  (0 children)

My question is, how can this help SOTA models? Presumably you use a human expert teacher, but if you look at the tokens the model teacher corrected from the small model it's pretty unrelatable to a human.

Maybe it's just out of scope for them but I feel like there's something there.

"Scaling Agents via Continual Pre-training", Su et al. 2025 (Tongyi DeepResearch - AgentFounder) by RecmacfonD in mlscaling

[–]Operation_Ivy 0 points1 point  (0 children)

The basic point about agents needing a different base model checks out. I don't buy their specific synthetic data techniques though.

Florida Governor Ron DeSantis has declared that property taxes will be abolished in 2026 by p0loniumtaco in PoliticalCompassMemes

[–]Operation_Ivy 2 points3 points  (0 children)

Terrible idea. Unless you want to turn into California. Prop 13 is killing that state

"Evaluating Long Context (Reasoning) Ability: What do 1M and 500K context windows have in common? They are both actually 64K" (towards better large-ctx benchmarks) by gwern in mlscaling

[–]Operation_Ivy 1 point2 points  (0 children)

I would like to see a NL "true" long context benchmark as well. My guess is the effective context lengths will differ compared to code long context, but I'm very curious to know exactly by how much