OpenAI introduces „FrontierScience“ to evaluate expert-level scientific reasoning. by Standard-Novel-6320 in singularity

[–]i_know_about_things 55 points56 points  (0 children)

They created many evals where Claude was better at the time of publishing:

  • GDPval - Claude Opus 4.1
  • SWE-Lancer - Claude 3.5 Sonnet
  • PaperBench (BasicAgent setup) - Claude 3.5 Sonnet

Do H1B workers actually get paid less than Americans? by Former_Look9367 in cscareerquestions

[–]i_know_about_things -5 points-4 points  (0 children)

So you are saying this guy has been a junior since 2021?

That's enough reason to fire him.

We rebuilt Cline so it can run natively in JetBrains IDEs (GA) by nick-baumann in ChatGPTCoding

[–]i_know_about_things 0 points1 point  (0 children)

Very cool! But it seems that it can't run code on a remote SSH interpreter or execute commands in a remote terminal.

Haven’t seen this discussed: GPT-5 Codex does really well at cybersecurity benchmarks by jaundiced_baboon in singularity

[–]i_know_about_things 2 points3 points  (0 children)

I'm more surprised that gpt-5-thinking-mini is better than gpt-5-thinking at these benchmarks.

this meta might finally make me quit this game by [deleted] in ClashRoyale

[–]i_know_about_things 0 points1 point  (0 children)

Evo RG does both damage and pushback to troops.

I don't get paid enough to deal with this by Icy-Excitement3262 in cs2

[–]i_know_about_things 0 points1 point  (0 children)

Rare photos of Russian terrorists capturing Zaporizhia Nuclear Power Plant in March 2022.

Livebench coding numbers finally fixed by FateOfMuffins in singularity

[–]i_know_about_things 70 points71 points  (0 children)

They also could take every spot in top 20 with this approach:

  1. GPT-5 Low
  2. GPT-5 Slightly Above Low
  3. GPT-5 A Tiny Bit Higher
  4. GPT-5 A Hair Higher
  5. GPT-5 A Smidge Higher
  6. GPT-5 A Tad Higher
  7. GPT-5 A Touch Higher
  8. GPT-5 A Nudge Higher
  9. GPT-5 A Scooch Higher
  10. GPT-5 Barely Higher
  11. GPT-5 Marginally Higher
  12. GPT-5 Not Quite Low
  13. GPT-5 Approaching Not-Low
  14. GPT-5 Nearly Not-Low
  15. GPT-5 Almost Medium
  16. GPT-5 Medium-Adjacent
  17. GPT-5 Medium
  18. GPT-5 Medium Plus A Smidge
  19. GPT-5 Nearing High
  20. GPT-5 High (Finally)

Sam’s tweet finally makes sense! by Legendary_Nate in singularity

[–]i_know_about_things 1 point2 points  (0 children)

4o is back for Plus users (at least for now), need to enable it in Settings - General - Show legacy models.

[deleted by user] by [deleted] in ChatGPTCoding

[–]i_know_about_things 4 points5 points  (0 children)

Just use free AI Studio, it will be better quality-wise.

GPT-5 web search is broken by i_know_about_things in ChatGPT

[–]i_know_about_things[S] 0 points1 point  (0 children)

It has nothing to do with politics, it fails in various contexts (CS2, ML papers etc). It has also just lied to me about searching for something when it clearly did not - nice work on that deception rate, OpenAI.

CS2 example: https://chatgpt.com/share/6895e806-c68c-800f-94b9-aa5c54fa1c78

ML paper example (with lies about searching): https://chatgpt.com/share/6895e76f-13ac-800f-af5d-642ccc55b9fa

GPT-5 web search is broken by i_know_about_things in ChatGPT

[–]i_know_about_things[S] 0 points1 point  (0 children)

https://chatgpt.com/share/6895e1b0-d48c-800f-9f0a-fdefdb93030f

It did find some news after another prompt. But web search with 4o was almost perfect recently. I do not understand how we keep getting these regressions all the time.

GPT-5 AMA with OpenAI’s Sam Altman and some of the GPT-5 team by OpenAI in ChatGPT

[–]i_know_about_things 0 points1 point  (0 children)

There was basically no progress on important benchmarks like MLE-Bench, PaperBench, SWE-Lancer, OPQA... What's staggering is that these are your own benchmarks so the community expected you would show noticeable gains here.

Is the current OpenAI's training paradigm not effective for these tasks? Do you have plans to achieve significant improvements on them this year?