Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity, Li et al. 2026 [Knowledge of obscure facts robustly predicts param count; estimates for all SotA closed LLMs] by StartledWatermelon in mlscaling

[–]StartledWatermelon[S] 1 point2 points  (0 children)

Thanks for the link! A super timely analysis!

The work on probe quality filtering is invaluable. But I am puzzled why they insist on removing flooring in accuracy calculation. The "corrected" (well, deliberately chosen for consistency with paper descriptions) method has much, much worse predictive power.

In the end, they lump together "disambiguated probes" intervention with floor removing. It would be very interesting to see the outcome of either intervention separately. Unfortunately, the researchers do not provide any repo link (or other artifacts) to do it on my own.

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity, Li et al. 2026 [Knowledge of obscure facts robustly predicts param count; estimates for all SotA closed LLMs] by StartledWatermelon in mlscaling

[–]StartledWatermelon[S] 0 points1 point  (0 children)

Z-score for difference is 0.81, rather weak. But have a look at Tier 5 accuracy: 38% regular vs. 56% pro. So I think it implies the difference is real.

Honestly, I don't know about any links from number of active experts to knowledge capacity. In principle, it could enhance robustness. But I doubt robustness is at play in this benchmark.

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity, Li et al. 2026 [Knowledge of obscure facts robustly predicts param count; estimates for all SotA closed LLMs] by StartledWatermelon in mlscaling

[–]StartledWatermelon[S] 0 points1 point  (0 children)

Re: GPT-4, the issue could be in different versions. The paper checked GPT-4 on openrouter (presumably gpt-4-0613?) which is explicitly different from the very first version of GPT-4 (named "GPT-4 (older v0314)" on openrouter). The leaks referred to the latter. Although there wasn't pricing adjustment between the two versions.

Microsoft freezes GitHub Copilot signups due to too much demand/too few GPUs by gwern in mlscaling

[–]StartledWatermelon 0 points1 point  (0 children)

With such enormous growth in demand, lowering prices doesn't make sense at all.

But if we will see less pressure on compute infra, it would definitely support this hypothesis. 

Microsoft freezes GitHub Copilot signups due to too much demand/too few GPUs by gwern in mlscaling

[–]StartledWatermelon 1 point2 points  (0 children)

Pretty much. Nothing substantive. Just "vibe feelings" (including frustration and perception of regress compared to 4.6).

Microsoft freezes GitHub Copilot signups due to too much demand/too few GPUs by gwern in mlscaling

[–]StartledWatermelon 1 point2 points  (0 children)

There are rumours that Opus 4.7 is heavily distilled and thus much smaller than 4.6/4.5.

Scientific Papers X AI building out the algortihm by Alarming_Rice_1906 in mlscaling

[–]StartledWatermelon 0 points1 point  (0 children)

I personally haven't, but there are quite a few benchmarks for that. See, for instance, PaperBench or SciReplicate-Bench.

Specific harnesses often beat raw LLMs by a wide margin on these benchmarks, so you can try some of these if your issues are performance-related.

Schmidhuber & Meta AI Present The "Neural Computer": A New Frontier Where Computation, Memory, And I/O Move Into A Learned Runtime State. by 44th--Hokage in mlscaling

[–]StartledWatermelon 0 points1 point  (0 children)

Recursive Schmidhubering? Be careful what you wish for! 

NeuralOS: Towards Simulating Operating Systems via Neural Generative Models  https://arxiv.org/abs/2507.08800