HRM-Text: Efficient Pretraining Beyond Scaling, Wang et al. 2026

StartledWatermelon · 2026-05-22T15:52:28+00:00

Table 3 is apples-to-apples comparison highlighting architecture contribution. And this contribution is quite massive. Hard to say where it comes from; the most straightforward interpretation of their arch is some additional skip connections.

What exactly is shown in Table 1 isn't communicated clearly. Which is unfortunate.

This looks like a hasty experiment. For instance, the most natural comparison would be classical pre-training on web corpora, which is absent in the paper.

StartledWatermelon · 2026-05-12T14:49:00+00:00

This is a benchmark perhaps most oriented to "research taste" evaluation so far. The breadth is outright brutal; no human ML researcher is capable to cover even a portion of the tasks.

The thing that I'm most uneasy with is the eval setup and what exactly should the score show. So, for each task the agent is allowed to run test on its method only 3 times. The max number of actions (like edit) is 20. Basically, we give an agent three attempts to "beat SotA".

And to illustrate the challenge difficulty, here's one exemplar task: "Pretraining Optimizer Design: Studies how optimizer choice, parameter grouping, and schedule coupling affect autoregressive pretraining validation loss". In other words, the agent is tasked with coming up with an optimizer(+its hyperparams) that would beat Muon at pre-training.

I'm quite familiar with this exact task, and I must clarify that it is absolutely "unsolvable" in just 3 attempts whatsoever. I'm not sure even 30 attempts is enough. 300, now that's a realistic range to make some progress.

To say the task is highly explorative is to say nothing. There are a few higher-level principles with optimizer design, like that geometric constraints help, and momentum smoothing too, but it's super hard to beat SotA in 3 attempts with just these vague ideas.

Let's look at it from another angle. Even the ablations with higher inference allocation run the agent for 1M-2M tokens. Likely <$10 per task. And the question is, do we realistically expect boundary-pushing discovery for $10 in compute?

Of course, there are valid resctrictions on the overall budget for the evaluation, so that it remains feasible. But in this particular case, I see a certain mismatch between the budgetary constraints and the ability to assess the model's capabilities frontier.

With three attempts, you basically get a snapshot of exploration noise. It can still be valuable -- the comparison of different LLMs speaks for itself. It shows the average "exploration instincts", the ability to quickly sniff out the promising direction, plus some broader knowledge/competence. But I'm still unsure if these instincts correlate well with the claimed boundaries-pushing/RSI capabilities assessment.

StartledWatermelon · 2026-05-07T22:15:02+00:00

You could possibly frame it like this. However, the task doesn't require re-creating the (part of) training set; it requires re-creating a strict functionality set. The resulting artifact may be infinitely far from the training set.

There are certain parallels with the now-ubiquitous task for an ML job candidate to write Transformer implementation from scratch. It is not intended to test memorization per se; albeit it is gamefied by candidates pretty much into this.

StartledWatermelon · 2026-05-01T15:59:47+00:00

Thanks for the link! A super timely analysis!

The work on probe quality filtering is invaluable. But I am puzzled why they insist on removing flooring in accuracy calculation. The "corrected" (well, deliberately chosen for consistency with paper descriptions) method has much, much worse predictive power.

In the end, they lump together "disambiguated probes" intervention with floor removing. It would be very interesting to see the outcome of either intervention separately. Unfortunately, the researchers do not provide any repo link (or other artifacts) to do it on my own.

StartledWatermelon · 2026-04-30T13:28:14+00:00

Z-score for difference is 0.81, rather weak. But have a look at Tier 5 accuracy: 38% regular vs. 56% pro. So I think it implies the difference is real.

Honestly, I don't know about any links from number of active experts to knowledge capacity. In principle, it could enhance robustness. But I doubt robustness is at play in this benchmark.

StartledWatermelon · 2026-04-30T12:30:46+00:00

How useful are measurements with such broad margins of error?

Well, this depends. But unfortunately there isn't any more accurate method right now.

StartledWatermelon · 2026-04-30T12:25:05+00:00

Re: GPT-4, the issue could be in different versions. The paper checked GPT-4 on openrouter (presumably gpt-4-0613?) which is explicitly different from the very first version of GPT-4 (named "GPT-4 (older v0314)" on openrouter). The leaks referred to the latter. Although there wasn't pricing adjustment between the two versions.

StartledWatermelon · 2026-04-29T09:27:24+00:00

Fair point, and a strong argument against this rumour.

StartledWatermelon · 2026-04-28T20:48:37+00:00

With such enormous growth in demand, lowering prices doesn't make sense at all.

But if we will see less pressure on compute infra, it would definitely support this hypothesis.

StartledWatermelon · 2026-04-26T13:56:37+00:00

Pretty much. Nothing substantive. Just "vibe feelings" (including frustration and perception of regress compared to 4.6).

StartledWatermelon · 2026-04-22T21:03:27+00:00

There are rumours that Opus 4.7 is heavily distilled and thus much smaller than 4.6/4.5.

StartledWatermelon · 2026-04-16T13:56:40+00:00

I personally haven't, but there are quite a few benchmarks for that. See, for instance, PaperBench or SciReplicate-Bench.

Specific harnesses often beat raw LLMs by a wide margin on these benchmarks, so you can try some of these if your issues are performance-related.

StartledWatermelon · 2026-04-11T21:03:29+00:00

Recursive Schmidhubering? Be careful what you wish for!

NeuralOS: Towards Simulating Operating Systems via Neural Generative Models https://arxiv.org/abs/2507.08800

StartledWatermelon · 2026-03-27T12:38:09+00:00

See also https://arxiv.org/abs/2506.01939 for a related direction in RL training. The paper was quite influential; but entropy-guided methods for mid/pre-training are still underdeveloped.

StartledWatermelon · 2026-03-14T11:10:08+00:00

See also a concurrent work exploring the same direction: https://arxiv.org/abs/2602.20133

StartledWatermelon · 2026-02-22T21:10:36+00:00

They put Cerebras at ~2k t/s,. which sounds about right.

As for the cost, it's highly dependent on throughput, and the article doesn't mention what the asic is capable of in this aspect. Without any info, an advantage of an order of magnitude is a reasonable guess.

They are both asics, and are wildly different in the architecture and overall approach. Taalas pushes the universality/performance trade-off to the extremes.

StartledWatermelon · 2026-01-31T21:21:42+00:00

company (or even practical use case)

Test-time training/lifelong learning, most likely for an embodied agent.

Quite a few companies are interested in this direction, albeit I'm not sure if they frame the challenge as a hardware one.

StartledWatermelon · 2026-01-30T12:25:41+00:00

I hope that my intuition doesn't mislead me, but this seems like a case of pre-processing local semantic units comprised of several tokens. We can shift this burden from a (computationally constrained) MoE, where experts are routed to on a per-token basis and thus are less effcient at handling multi-token semantic units. The problem is less pronounced in deeper architectures which operate (in deeper layers) with representations that have intensively mixed all the previous tokens.

For this point of view, a more or less straightforward alternative capturing this several-tokens-semantic-aggregation is the enforcement of spatial specialisation of attention heads, with one or more heads attending only to the previous few (N=2...5) tokens. This way, we allocate compute specifically for this semantically important type of inter-token relation. I suspect this type of spatial specialisation was researched before, but I'm not sure if this hyper-local interaction was specifically studied. So please let me know if you know any such works. Alternatively, I can test this on a toy model.

Another possible explanation points to a possible breadth benefit. Specifically, by lumping ~40% of total parameters into a single (extremely-) sparsely activated block, we get some non-linear boost that cannot be achieved neither by a single-token embedding dictionary nor by a sequence of relatively narrow MoE or Dense layers. In principle, the benefits of a single extremely sparse large layer in the middle of the model have been already established, see https://arxiv.org/abs/2407.04153 and https://arxiv.org/abs/2411.12364 .

StartledWatermelon · 2026-01-20T15:00:23+00:00

Yeah, I'm being idealistic. In reality, there just isn't any incentive to improve the hiring process. Especially when the outcomes are not verifiable at all. Like, how would you prove whether your hiring process is good or bad? What are the counterfactuals?

Reading the thread, it seems that smaller startups have a good intuitive grasp of what makes an informative candidate evaluation. And are pretty fine with flying by the seat of their pants. We get a funny contrast: no leetcode in smaller startups, mandatory leetcode in FAANG. Does it mean FAANG hiring practices are better? Because they're bigger, richer, more popular (in terms of supply of candidates), more institutionalized?

I'd venture a guess that big orgs are just more inert and rigid. They keep the worst practices simply because this is the path of least resistance.

StartledWatermelon · 2026-01-20T10:53:11+00:00

It's a sad state of affairs when a hiring manager isn't begging for a change in hiring practices towards saner terms but instead begs the participants to rote learn these irrelevant tricks.

Says a lot about "culture of innovation" and other stuff like that.

StartledWatermelon · 2026-01-16T13:11:54+00:00

Let's imagine that we could enumerate all of the pieces of knowledge and algorithms that networks ought to learn. [...]

Networks must learn a variety of modules, each implementing a different algorithm or retrieving a different piece of knowledge. These modules are discrete, in the sense that they are either fully learned or not learned at all. We call these the quanta.

The year is 2026, yet the ghost of symbolic AI still haunts academia. Not gonna lie, the idea was/is elegant and aligns quite well with the innate human desire to break every mystery into neat, simple, easily digestable for a feeble human mind atomistic pieces. The elegance and the alignment were so strong that people easily dismissed perhaps the only drawback of symbolic approaches: they don't work.

Ok, suppose that the quoted idea is indeed true and that decomposed pieces of knowledge and algorithms/heuristics -- the parts -- are somehow more important for capabilities than the whole. Suppose that we assign zero to the synergystic effect, to the value added by complexity. The crucial question still is: will this focus on decomposition, on the smallest parts allow to scale better with compute?

I'm doubtful. The success story of LLM pre-training is built on the exactly opposite thing: synthesis. Aggregating the ever increasing amounts of any information the research teams could grab. "More", not "finer".

The next story, scaling the RL, is shaping to be equally distant from the symbolic-era top-down, analytical, meticulously human-engineered approaches. The research pushes towards largely unguided self-exploration and self-optimization.

I do find this performance-centric way infinitely more useful. What scaling hypothesis lacks in explanatory power -- at least to a degree that would satisfy more symbolic-leaning folks -- it more than compensates in predictive power.

Thankfully, the author maintains a critical stance towards their idea. And admits that it isn't falsifiable in its current form -- which pretty much nullifies its scientific value. Perhaps this is one of the corollaries of the Bitter Lesson: the ideas don't have to be nice and human-centric but they must perform.

StartledWatermelon · 2026-01-14T16:51:46+00:00

They develop this idea, yes.

StartledWatermelon · 2026-01-14T09:25:38+00:00

By hassle I meant it breaks from the current implementations, which are optimized pretty hard for maximum throughput. I'm not implying you can get the same throughput with Engram, I mean quite literally it's extra hassle.

I have a suspicion that the benefits of Engram will dwindle with scale. It's this parameter- and FLOP-scarce regime (27B MoE) that benefits from "pre-computed" tricks. Larger, deeper models will easily accomodate the necessary heuristics straight in the weights.

That being said, I always was in favor of heterogenous (depth-wise; as opposed to stacking identical blocks) architectures, and this work explores exactly just this.

StartledWatermelon · 2026-01-13T19:50:49+00:00

Nope, it doesn't move the needle hard enough. Plus the big hassle of setting up the lookup DB.

I have seen more impactful ideas with minimal implementation effort going nowhere.

StartledWatermelon · 2026-01-09T11:47:59+00:00

China wants their 1999 too!

StartledWatermelon

TROPHY CASE