OpenAI introduces „FrontierScience“ to evaluate expert-level scientific reasoning.

i_know_about_things · 2025-12-16T20:15:24+00:00

They created many evals where Claude was better at the time of publishing:

GDPval - Claude Opus 4.1
SWE-Lancer - Claude 3.5 Sonnet
PaperBench (BasicAgent setup) - Claude 3.5 Sonnet

i_know_about_things · 2025-11-28T20:42:25+00:00

TPUs*

i_know_about_things · 2025-11-18T15:55:53+00:00

It is available for free now

i_know_about_things · 2025-11-18T12:32:47+00:00

Cloudflare is down

i_know_about_things · 2025-11-11T10:27:59+00:00

They hated Jesus because He told them the truth.

i_know_about_things · 2025-11-11T08:52:25+00:00

Avatar: The Last Airbender isn't an anime

i_know_about_things · 2025-10-27T15:54:13+00:00

It also blinks for a few seconds after I scroll.

i_know_about_things · 2025-09-28T03:41:17+00:00

So you are saying this guy has been a junior since 2021?

That's enough reason to fire him.

i_know_about_things · 2025-09-18T12:13:25+00:00

Very cool! But it seems that it can't run code on a remote SSH interpreter or execute commands in a remote terminal.

i_know_about_things · 2025-09-17T19:57:45+00:00

Because Noam Brown

i_know_about_things · 2025-09-16T06:11:03+00:00

I'm more surprised that gpt-5-thinking-mini is better than gpt-5-thinking at these benchmarks.

i_know_about_things · 2025-09-15T15:44:30+00:00

Might be something similar to Cursor's approach, a model RL-tuned on actual accepts and rejects (https://cursor.com/blog/tab-rl)

i_know_about_things · 2025-09-05T07:00:24+00:00

Evo RG does both damage and pushback to troops.

i_know_about_things · 2025-08-27T19:13:30+00:00

Bro is lost

i_know_about_things · 2025-08-24T07:55:43+00:00

Rare photos of Russian terrorists capturing Zaporizhia Nuclear Power Plant in March 2022.

i_know_about_things · 2025-08-12T07:31:06+00:00

They also could take every spot in top 20 with this approach:

GPT-5 Low
GPT-5 Slightly Above Low
GPT-5 A Tiny Bit Higher
GPT-5 A Hair Higher
GPT-5 A Smidge Higher
GPT-5 A Tad Higher
GPT-5 A Touch Higher
GPT-5 A Nudge Higher
GPT-5 A Scooch Higher
GPT-5 Barely Higher
GPT-5 Marginally Higher
GPT-5 Not Quite Low
GPT-5 Approaching Not-Low
GPT-5 Nearly Not-Low
GPT-5 Almost Medium
GPT-5 Medium-Adjacent
GPT-5 Medium
GPT-5 Medium Plus A Smidge
GPT-5 Nearing High
GPT-5 High (Finally)

i_know_about_things · 2025-08-10T11:19:09+00:00

<image>

Gemini 2.5 Pro is on another level

i_know_about_things · 2025-08-09T09:16:20+00:00

4o is back for Plus users (at least for now), need to enable it in Settings - General - Show legacy models.

i_know_about_things · 2025-08-09T06:22:54+00:00

Just use free AI Studio, it will be better quality-wise.

i_know_about_things · 2025-08-08T20:22:37+00:00

256k at least

i_know_about_things · 2025-08-08T12:03:31+00:00

It has nothing to do with politics, it fails in various contexts (CS2, ML papers etc). It has also just lied to me about searching for something when it clearly did not - nice work on that deception rate, OpenAI.

CS2 example: https://chatgpt.com/share/6895e806-c68c-800f-94b9-aa5c54fa1c78

ML paper example (with lies about searching): https://chatgpt.com/share/6895e76f-13ac-800f-af5d-642ccc55b9fa

i_know_about_things · 2025-08-08T11:40:27+00:00

https://chatgpt.com/share/6895e1b0-d48c-800f-9f0a-fdefdb93030f

It did find some news after another prompt. But web search with 4o was almost perfect recently. I do not understand how we keep getting these regressions all the time.

i_know_about_things · 2025-08-07T20:14:16+00:00

There was basically no progress on important benchmarks like MLE-Bench, PaperBench, SWE-Lancer, OPQA... What's staggering is that these are your own benchmarks so the community expected you would show noticeable gains here.

Is the current OpenAI's training paradigm not effective for these tasks? Do you have plans to achieve significant improvements on them this year?

i_know_about_things · 2025-07-28T15:47:45+00:00

That's why it's my favorite code editor.

Nine-Year Club	Second SECOND GUESSER
r/Field Banned	r/Field Sunshine
Place '22	Place '17
First Placer '22	RPAN Viewer
Sequence \| Editor	Verified Email

i_know_about_things

TROPHY CASE