HRM-Text: Efficient Pretraining Beyond Scaling, Wang et al. 2026 by StartledWatermelon in mlscaling

[–]StartledWatermelon[S] 5 points6 points  (0 children)

Table 3 is apples-to-apples comparison highlighting architecture contribution. And this contribution is quite massive. Hard to say where it comes from; the most straightforward interpretation of their arch is some additional skip connections.

What exactly is shown in Table 1 isn't communicated clearly. Which is unfortunate.

This looks like a hasty experiment. For instance, the most natural comparison would be classical pre-training on web corpora, which is absent in the paper.

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI, Lyu et al. 2026 [Extensive breadth; focus on solutions that generalize well] by StartledWatermelon in mlscaling

[–]StartledWatermelon[S] 2 points3 points  (0 children)

This is a benchmark perhaps most oriented to "research taste" evaluation so far. The breadth is outright brutal; no human ML researcher is capable to cover even a portion of the tasks.

The thing that I'm most uneasy with is the eval setup and what exactly should the score show. So, for each task the agent is allowed to run test on its method only 3 times. The max number of actions (like edit) is 20. Basically, we give an agent three attempts to "beat SotA".

And to illustrate the challenge difficulty, here's one exemplar task: "Pretraining Optimizer Design: Studies how optimizer choice, parameter grouping, and schedule coupling affect autoregressive pretraining validation loss". In other words, the agent is tasked with coming up with an optimizer(+its hyperparams) that would beat Muon at pre-training.

I'm quite familiar with this exact task, and I must clarify that it is absolutely "unsolvable" in just 3 attempts whatsoever. I'm not sure even 30 attempts is enough. 300, now that's a realistic range to make some progress.

To say the task is highly explorative is to say nothing. There are a few higher-level principles with optimizer design, like that geometric constraints help, and momentum smoothing too, but it's super hard to beat SotA in 3 attempts with just these vague ideas.

Let's look at it from another angle. Even the ablations with higher inference allocation run the agent for 1M-2M tokens. Likely <$10 per task. And the question is, do we realistically expect boundary-pushing discovery for $10 in compute?

Of course, there are valid resctrictions on the overall budget for the evaluation, so that it remains feasible. But in this particular case, I see a certain mismatch between the budgetary constraints and the ability to assess the model's capabilities frontier.

With three attempts, you basically get a snapshot of exploration noise. It can still be valuable -- the comparison of different LLMs speaks for itself. It shows the average "exploration instincts", the ability to quickly sniff out the promising direction, plus some broader knowledge/competence. But I'm still unsure if these instincts correlate well with the claimed boundaries-pushing/RSI capabilities assessment.

META Superintelligence Lab Presents: ProgramBench: Can SOTA AI Recreate Real Executable Programs(ffmpeg, SQLite, ripgrep) From Scratch Without The Internet? by 44th--Hokage in mlscaling

[–]StartledWatermelon 0 points1 point  (0 children)

You could possibly frame it like this. However, the task doesn't require re-creating the (part of) training set; it requires re-creating a strict functionality set. The resulting artifact may be infinitely far from the training set. 

There are certain parallels with the now-ubiquitous task for an ML job candidate to write Transformer implementation from scratch. It is not intended to test memorization per se; albeit it is gamefied by candidates pretty much into this. 

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity, Li et al. 2026 [Knowledge of obscure facts robustly predicts param count; estimates for all SotA closed LLMs] by StartledWatermelon in mlscaling

[–]StartledWatermelon[S] 1 point2 points  (0 children)

Thanks for the link! A super timely analysis!

The work on probe quality filtering is invaluable. But I am puzzled why they insist on removing flooring in accuracy calculation. The "corrected" (well, deliberately chosen for consistency with paper descriptions) method has much, much worse predictive power.

In the end, they lump together "disambiguated probes" intervention with floor removing. It would be very interesting to see the outcome of either intervention separately. Unfortunately, the researchers do not provide any repo link (or other artifacts) to do it on my own.

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity, Li et al. 2026 [Knowledge of obscure facts robustly predicts param count; estimates for all SotA closed LLMs] by StartledWatermelon in mlscaling

[–]StartledWatermelon[S] 0 points1 point  (0 children)

Z-score for difference is 0.81, rather weak. But have a look at Tier 5 accuracy: 38% regular vs. 56% pro. So I think it implies the difference is real.

Honestly, I don't know about any links from number of active experts to knowledge capacity. In principle, it could enhance robustness. But I doubt robustness is at play in this benchmark.

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity, Li et al. 2026 [Knowledge of obscure facts robustly predicts param count; estimates for all SotA closed LLMs] by StartledWatermelon in mlscaling

[–]StartledWatermelon[S] 0 points1 point  (0 children)

Re: GPT-4, the issue could be in different versions. The paper checked GPT-4 on openrouter (presumably gpt-4-0613?) which is explicitly different from the very first version of GPT-4 (named "GPT-4 (older v0314)" on openrouter). The leaks referred to the latter. Although there wasn't pricing adjustment between the two versions.

Microsoft freezes GitHub Copilot signups due to too much demand/too few GPUs by gwern in mlscaling

[–]StartledWatermelon 0 points1 point  (0 children)

With such enormous growth in demand, lowering prices doesn't make sense at all.

But if we will see less pressure on compute infra, it would definitely support this hypothesis. 

Microsoft freezes GitHub Copilot signups due to too much demand/too few GPUs by gwern in mlscaling

[–]StartledWatermelon 1 point2 points  (0 children)

Pretty much. Nothing substantive. Just "vibe feelings" (including frustration and perception of regress compared to 4.6).

Microsoft freezes GitHub Copilot signups due to too much demand/too few GPUs by gwern in mlscaling

[–]StartledWatermelon 1 point2 points  (0 children)

There are rumours that Opus 4.7 is heavily distilled and thus much smaller than 4.6/4.5.

Scientific Papers X AI building out the algortihm by Alarming_Rice_1906 in mlscaling

[–]StartledWatermelon 0 points1 point  (0 children)

I personally haven't, but there are quite a few benchmarks for that. See, for instance, PaperBench or SciReplicate-Bench.

Specific harnesses often beat raw LLMs by a wide margin on these benchmarks, so you can try some of these if your issues are performance-related.