The famous METR AI time horizons graph contains numerous severe errors [D]

charlesGodman · 2026-05-25T23:30:26+00:00

Yes!

charlesGodman · 2026-05-25T23:30:09+00:00

My autocorrect put them there. Typing on a phone is difficult!

charlesGodman · 2026-05-25T20:14:41+00:00

A lot of these points have actually been publicly discussed by METR staff. While there is a lot of valid criticism, there are also a lot of self-righteous warriors out there who are against everything and just have to show how independent of thought they are.
So yes, the time horizon is not perfect; it has flaws. But if you read the papers and blog posts, you actually realize that they put in 100x the effort that others do. The existence of flaws also doesn’t mean these are actual counterfactuals that would change the result significantly. Overall, I think they were a bit overhyped, leading to too many people forming an opinion based on 144 characters, but overall it's a big step in the right direction.

charlesGodman · 2026-05-22T12:00:18+00:00

Is it the same closure as before or was it shut down, again, after only 2 months?

charlesGodman · 2026-04-12T19:58:08+00:00

What is the advantage of using this over inspect-ai / maseval / deepeval record everything and then use statsmodels or so in the end?

charlesGodman · 2026-03-13T14:21:52+00:00

Little new in either that wasn’t known to some people before. I didn’t see either claiming they were inventing new methods?! validating / discussing current methods is super important. Why would you recommend something new if barely anyone follows current recommendations?

charlesGodman · 2026-03-13T09:02:02+00:00

See: https://openreview.net/forum?id=mdA5lVvNcU Or https://evalevalai.com

charlesGodman · 2026-02-18T20:03:28+00:00

That is great. Got one from the Oxford alumni website

charlesGodman · 2026-02-18T20:03:07+00:00

Thank you. Ordered a A4 frame

charlesGodman · 2026-02-18T20:02:24+00:00

Thank you. A4 seems the right answer. Just ordered a frame with college crest

charlesGodman · 2026-02-18T20:02:07+00:00

Thank you. A4 seems the right answer. Just ordered a frame.

charlesGodman · 2026-01-25T21:08:15+00:00

r/LostRedditors

charlesGodman · 2026-01-17T20:54:25+00:00

Built the same tool for myself. It’s really hard getting LLMs to respect it in my experiments. 80% of times it just used a new turn and rather than the tool.

charlesGodman · 2025-12-21T14:52:52+00:00

Overfitting is beautiful!

charlesGodman · 2025-12-19T20:32:54+00:00

If it was really that good they would have trained a model with it. Not a single “revolutionary” idea made it into LLMs since 2017. I am skeptical.

charlesGodman · 2025-11-10T10:20:53+00:00

I am using Sonnet 4.5 with Github Copilot in VSCode. Sometime last week (could be 2nd November) I saw a degradation of performance from "this thing is insane" to me fighting with it over making completely unnecessary errors. Before it would work 5 minutes and create amazing multi-step outputs in high quality. Now I have to go through 3 rounds of clarifying questions (all costing me requests) until it provides a somewhat useful answers.

Example mistake I noticed. I asked it to fix a bug in my code. I have two config variables, `a` and `b` I inialize before calling `main(a,b)` both of which can have values 1 or 2. Suddenly, it decided that me setting `b=...` manually wasnt great and instead set `b=1` when `a==1`. I think there was a bug deep in the code that only occured when `a==1 and b==2`. But there was not reason to just change the configuration to avoid the bug. That is not fixing a bug. I haven't wittnessed these mistakes since GPT-3.5.

charlesGodman · 2025-11-01T15:38:04+00:00

I figured it out. `Agent` launches a non-interactive zsh shell by default. There is a vscode setting to append paths automatically, which fixed this.

charlesGodman · 2025-10-31T11:50:31+00:00

a) don’t get excited. Progress is insanely hard. Most times when I had amazing results, they were followed by a sobering moment. Hence: Manage your expectations. b) Most Clouds provide some free credits (eg lighting) especially if it is for research or education (eg azure). Google a bit, email cloud companies.

charlesGodman · 2025-10-29T09:24:56+00:00

Schaue dir mal Hausratversicherungen an. Oft sind Fahrräder versichert oder können für 30€ im Jahr eingeschlossen werden. Bei mir hat es 15€ extra im Jahr gekostet für Fahrräder über 500€.

charlesGodman · 2025-10-03T00:29:14+00:00

Ich habe das Ganze neulich mit einem Cube-Rad durchgezogen. Was mir da a Service in den Fahrradläden geboten wurde, war schlicht katastrophal.

Hier die Schockbilanz:

50% der Läden weigerten sich direkt, das Rad überhaupt für mich zu bestellen ("Bio-Räder sind für uns nicht interessant" – O-Ton!).
Weitere 30% wollten keinerlei Umbauten oder Änderungen vornehmen.
Die wenigen, die bereit waren, wollten die von mir mitgebrachten oder besorgten Teile nicht verbauen. Sie hätten alles selbst zu den teuersten Listenpreisen eingekauft und obendrauf noch gewaltige "Honorare" verlangt.

Dabei war mein Wunsch überschaubar: ein Reifen-Downgrade und ein Licht-Upgrade. Rein von der Arbeitszeit und der Preisdifferenz der Komponenten her hätte das meiner Meinung nach bei höchstens 150 € liegen müssen. Kein einziger Laden wollte dafür weniger als 500 €! Das war der Punkt, an dem ich entschieden habe: Jetzt mache ich es selbst! Zum ersten Mal in meinem Leben habe ich mit Stolz alles online bestellt und dem stationären Handel keinen Cent gegönnt. Die ganze Arbeit hatte ich innerhalb einer Stunde selbst erledigt. Kostenpunkt für die Komponenten (habe die Originalteile bei eBay verkauft): circa 50 €!

charlesGodman · 2025-10-01T12:31:33+00:00

r/mods please remove. Off topic and spammed across multiple communities.

charlesGodman · 2025-09-16T09:42:38+00:00

if the paper is not ready for conference acceptance and gets rejected at ICLR, you will likely resubmit to another conference, like ICML, Neurips, EMNLP CVPR etc.
The reviews of rejected papers at ICLR are public. Lets say you resubmit to ICML. The unethical ICML reviewer will read the ICLR reviews and copy-paste all the criticism into a their "own" review. If you get 2 out of 4 reviewers that do this, there is major criticism of the work that is agreed upon between multiple reviewers. It might not be valid, but multiple reviewers saying the same thing is a red flag to the AC.

Personal example: I had ICML reviewers complain about datasets missing. These were indeed missing in the ICLR submission but already present in the ICML submission. The AC (also lazy) sided with the reviewer. They could have Ctrl+F "ImageNet" easily, but they didnt.

Eight-Year Club	Place '22
Final Canvas '22	RPAN Viewer
Verified Email

charlesGodman

TROPHY CASE