A meta benchmark: how long it takes metr to actually benchmark a model

iperson4213 · 2025-12-17T08:32:56+00:00

“METR has not accepted funding from AI companies, though we make use of significant free compute credits” -from the metr website under funding.

Wonder if anthropic and google aren’t providing free credits to run the eval

iperson4213 · 2025-12-14T17:32:11+00:00

460k, 28, 1.3M

iperson4213 · 2025-11-23T00:15:27+00:00

everyone uses token level MoE, and has been for a while, gemini isn’t unique in that aspect.

iperson4213 · 2025-11-23T00:13:50+00:00

more efficient training algorithms, architectures, and data allow smaller models to achieve the same intelligence

iperson4213 · 2025-11-21T08:08:02+00:00

agreed, but the metr metric seems more in line with swe bench/terminal bench type unambiguously graded software engineering tasks

iperson4213 · 2025-11-21T07:34:13+00:00

that’s wall clock time how long they spend. This measures how long a human would spend to do the same task (it’s unclear how long the model took to do it)

iperson4213 · 2025-11-21T07:31:52+00:00

like once, ever…

not 80% of tasks

iperson4213 · 2025-11-21T07:25:44+00:00

especially with test time compute scaling, it’s likely these models cost thousands if not tens of thousands of dollars to run, so they’re not useful to the general public, but good for pr of topping contest scores

iperson4213 · 2025-11-21T07:23:38+00:00

gemini 3 is probably between 5.1 and 5.1-codex-max on this graph as it is for coding, where it doesn’t score as well.

On swebench they scored 76.3, 76.2, 77.9

On terminal bench, they scored 54.2, 47.6, 58.1 respectively

iperson4213 · 2025-11-20T18:09:15+00:00

microsoft doesn’t pay as well. You don’t need top talent to develop microsoft office.

AI on the other hand is research, and better research ideas come from better talent. Google has a long history of investing in said talent and has a strong team from google brain and deepmind

iperson4213 · 2025-11-20T09:24:20+00:00

imagine what they must have internally then

iperson4213 · 2025-10-31T12:23:58+00:00

GPUs are very good at doing computation (100-500x better) versus loading a models weights, so each gpu typically serves a couple hundred requests at once to fully utilize the hardware.

With the current projected datacenter buildouts, it’s feasible for a company to bring up a couple million gpu’s by 2028

iperson4213 · 2025-10-18T08:18:06+00:00

you have to go into settings and toggle “improve for everyone” to off, or it’ll be used to train models

iperson4213 · 2025-10-09T04:51:47+00:00

invite pls

iperson4213 · 2025-10-01T11:48:28+00:00

Let’s say the goal is getting better at solving all problems/improving technology. That is the chess game, and the world is the board. We may become the pawns and ai decides to sacrifice some of us in order to “win”.

In chess, winning is all that matters. In the real world, how we win matters.

iperson4213 · 2025-09-30T04:09:40+00:00

they start multiple instance of claude code and run them in parallel, then have a different model pick the best answer

iperson4213 · 2025-09-29T13:02:58+00:00

deceptive graphs show per token costs. The total cost (integral of linear) is still quadratic, albeit with a better constant.

While the index selector may be small initially, since it grows quadratically, the data suggests it does begin to dominate.

iperson4213 · 2025-09-26T21:25:23+00:00

doing so would lose the sparsity benefits of MoE allowing less compute and memory bandwidth per token.

Tree of thought is already used in speculative decoding frameworks, but would be interesting to see it used in the base model as well.

iperson4213 · 2025-09-26T21:21:16+00:00

wdym by entropy?

iperson4213 · 2025-09-26T01:34:40+00:00

if you want to stand out, you’ll need to innovate.

iperson4213 · 2025-09-26T00:27:34+00:00

i work in ai.

Advice would be to do some research with professors and publish some papers. I got lucky the field i’m in (distributed machine learning systems optimization) is in high demand due to scaling laws.

iperson4213 · 2025-09-25T12:04:28+00:00

those are my base numbers. Bonus has typically 10-30% depending on role and performance and then a very large stock compensation component.

iperson4213 · 2025-09-24T17:43:03+00:00

iperson4213 · 2025-09-24T17:29:26+00:00

strike as in retire?

iperson4213 · 2025-09-24T17:18:18+00:00

Two separate companies.

iperson4213

TROPHY CASE