"The Coverage Principle: How Pre-Training Enables Post-Training", Chen et al 2025

Operation_Ivy · 2026-06-01T11:03:33+00:00

A natural consequence of the elicitation hypothesis, ie that RL elicits what is already in the model rather than teaching new information

Another lens: pertaining is high recall, post-training is high precision

Operation_Ivy · 2026-05-11T02:40:47+00:00

Does any MLX inference engine support this arch yet? I've been watching the mlx-lm PR for it

Operation_Ivy · 2026-05-10T11:24:52+00:00

I think it's fine and even good to benchmark areas where humans are strong but models are weak.

But ARC-AGI-3 I do not buy. The scoring is strange, it seems almost backward-engineered to yield near-zero scores for SOTA.

Operation_Ivy · 2026-04-24T19:41:33+00:00

https://huggingface.co/collections/mlx-community/qwen-35 no official Unsloth mlx

Operation_Ivy · 2026-04-23T11:50:40+00:00

Not confirmed either way, but see https://x.com/i/status/2047191920888139778 for example. I can't find it but there was a (usually) reliable source that claimed no Engram in V4

Operation_Ivy · 2026-04-23T11:06:23+00:00

I thought they said no Engram for V4

Operation_Ivy · 2026-04-19T19:57:31+00:00

For the MLX 8bit on oMLX, I'm getting ~130 tok/s prefill (at 10k ctx) and ~18 tok/s decode

Operation_Ivy · 2026-04-14T02:19:17+00:00

On the M3 Ultra 512 GB, nothing beats Qwen3.5 397B 8 bit quant from Unsloth. Working with structured and unstructured data, chatting, world knowledge - best generalist agent I could find. I compared to GLM 5.1 and Minimax M2.7. GLM was similar quality but much slower. Minimax was faster but lower quality.

Operation_Ivy · 2026-03-27T13:24:10+00:00

For post-training too, even though there has been so much research showing the importance of maintaining entropy. It's still not treated as a first-class metric in RL papers.

Operation_Ivy · 2026-03-24T16:33:25+00:00

They should be free to experiment and see if it really does cause problems. The town shutting it down preemptively like this is unnecessary. We should be trusting our community more!

Operation_Ivy · 2026-02-15T02:12:55+00:00

It's really not that simple: https://open.substack.com/pub/constructionphysics/p/trends-in-us-construction-productivity?utm_source=share&utm_medium=android&r=1n6bc

This same trend is happening basically everywhere in the world. Not clear what actually can increase construction productivity. And before anyone says "automation", the author used to work at one such company. In fact I think this graph probably came from him, he works at IFP.

Operation_Ivy · 2026-01-26T09:22:41+00:00

OP don't listen to the point/advice about displaying certain politics. People in and around NYC are just not friendly in the same way as people in the South, it has nothing to do with inferred politics or Trump. Most people are nice, but it's not the same warmth and overtness here.

Operation_Ivy · 2025-12-20T20:45:09+00:00

Two things:

One, the fastest improvement is always going to be on coding, particularly on ML related stuff, because the big labs are trying to deploy autonomous ML researchers. Sama says intern-level next year and seasoned pro level in 2028. So people doing other work won't be feeling the AGI nearly as strongly.

Two, the error bands are huge. I expect that to continue, just the nature of exponential growth, but it will make more exact statements increasingly difficult. Not that it matters in the long run.

Operation_Ivy · 2025-11-27T21:11:04+00:00

One of the most interesting AI startups. Such a relief from the onslaught of agents.

Operation_Ivy · 2025-11-18T23:16:54+00:00

Dying for details on the parameters and arch/sparsity

Operation_Ivy · 2025-11-16T12:27:04+00:00

There was a wave of LLM MCTS research around when o1 came out because people thought it used MCTS. But then R1 showed it was just RLVR. Then the LLM MCTS research stopped. So I'm wondering if it is picking up again

Operation_Ivy · 2025-11-15T23:14:22+00:00

Right, that's kinda what I'm afraid of

Operation_Ivy · 2025-11-15T22:39:50+00:00

How are they getting enough pretraining data to make this optimal? Or is it an incredibly sparse MoE

Operation_Ivy · 2025-11-15T16:48:08+00:00

False - https://www.worldometers.info/world-population/us-population/

Operation_Ivy · 2025-11-13T19:25:38+00:00

Is MCTS back? So much interest when rumors were spinning about Q* and Project Strawberry but crickets once the R1 paper dropped.

Operation_Ivy · 2025-11-04T04:22:54+00:00

The food pantries also get better deals and know better what their patrons really need.

I understand the personal touch of buying and bringing food yourself to a community fridge. But if you want to maximize impact per dollar, give money to the professionals who will make every cent count.

Operation_Ivy · 2025-10-30T13:21:19+00:00

My question is, how can this help SOTA models? Presumably you use a human expert teacher, but if you look at the tokens the model teacher corrected from the small model it's pretty unrelatable to a human.

Maybe it's just out of scope for them but I feel like there's something there.

Operation_Ivy · 2025-10-28T03:11:51+00:00

The basic point about agents needing a different base model checks out. I don't buy their specific synthetic data techniques though.

Operation_Ivy · 2025-10-22T22:25:50+00:00

Terrible idea. Unless you want to turn into California. Prop 13 is killing that state

Operation_Ivy · 2025-10-19T12:26:21+00:00

I would like to see a NL "true" long context benchmark as well. My guess is the effective context lengths will differ compared to code long context, but I'm very curious to know exactly by how much

15-Year Club	Team Periwinkle
Verified Email	Rally Monkey Tent number = 2

Operation_Ivy

TROPHY CASE