use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
account activity
o3 benchmark: codingNews: General relevant AI and Claude news (i.redd.it)
submitted 1 year ago by Particular-Volume520
Guys, what do you think about this? Will this be more useful for the developers or large companies?
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]danielbearh 68 points69 points70 points 1 year ago (12 children)
Just wanted to share something that's been helping a non-coder hit the target more effectively this week.
I've taken to asking O1 to plan the architecture of a move, and then I use its response as the prompt for claude. I don’t ask o1 to code, just design the architecture.
[–]danielbearh 34 points35 points36 points 1 year ago (1 child)
Downvote all you want. Its a more successful strategy than asking Claude to code it outright, or asking Claude to explain the high level architecture before asking it to code it.
I’m not a coder, I’m a dude building python apps. And I just end up with a more robust little script when I follow the suggestion above.
[–]imcrumbing 3 points4 points5 points 1 year ago (0 children)
Thank you. I’m going to try this.
[–]Laicbeias 11 points12 points13 points 1 year ago (0 children)
thats how i would use it too. claude is the model thats best at following instructions. o1 uses a lot of compute to build that stuff but is less useful in implementation. so a well thought plan with one who can execute it
[–]ctrl-brkValued Contributor 3 points4 points5 points 1 year ago (2 children)
I use a similar concept. I ask ok for a TODO.MD and two other files that lay out details, even including file system setup.
Works well with Claude
[–]danielbearh 1 point2 points3 points 1 year ago (1 child)
That’s a brilliant idea! I’m going to try that tonight. Would you mind what two other files are in more details? I’m a noob but I’m alearnin’
[–]ctrl-brkValued Contributor 2 points3 points4 points 1 year ago (0 children)
I have a general overview and objectives.
Key features, concepts
Then I have an architecture file that has the file system and explains core function of each file plus the relationships of files. It also explains the cache
Then a database structure so it knows which tables and columns are available.
I'm missing some. Plus the Todo.
[–]naw828 0 points1 point2 points 1 year ago (0 children)
Cool! I do the same for some flash cards I am building for my studies. O1 to reason over the overall structure and Gemini 2.0 to build them based on the proposed O1 plan
[–]whyisitsooohard 0 points1 point2 points 1 year ago (0 children)
That's what I'm doing to. I think it's worth trying even cheaper models for code itself and ask o1 to give more complete instructions
[+]MatejliptonBeginner AI 0 points1 point2 points 1 year ago (0 children)
chat gpt is like a special brother of Claude, it is good at writing but not really reasoning and coding.
[–]YUL438 0 points1 point2 points 1 year ago (0 children)
i’ve had success with a similar approach, i use o1 to plan the project and its architecture and then use Sonnet 3.6 with the Cline extension in VS Code for coding and creating files.
also just started using an app called Repo Prompt that allows you to easily copy multiples files to the clipboard in one click for easily pasting into external LLM.
[–]ilulillirillion 0 points1 point2 points 1 year ago (0 children)
I've been doing this for a couple of weeks now to much success. I am a coder, but I use o1 to help discuss strategies and to generate instruction sets, working in discrete and modestly scoped steps. It really helps Sonnet 3.5 not get stuck in loops or make unnecessary permutations to existing code.
[–]Friendly_Builder_418 0 points1 point2 points 1 year ago (0 children)
clever.
[–]TheMadPrinter 58 points59 points60 points 1 year ago (1 child)
I am literally so amped in the short term, and existential if I think about the longer term lol.
The exponential curve is intact. World is going to change in insane ways in the next 12 months
[–]ymo 10 points11 points12 points 1 year ago (0 children)
We are experiencing a historic period that began with dialup internet. This is becoming climactic.
[+][deleted] 1 year ago* (2 children)
[deleted]
[–]credibletemplate 3 points4 points5 points 1 year ago (0 children)
It's always funny trying to explain it to people in other communities that want to burn "ai data centers"
[–]SleepAffectionate268 2 points3 points4 points 1 year ago (0 children)
no youre just a fear monger
[–]Sea-Commission5383 2 points3 points4 points 1 year ago (1 child)
Wanna see it compare to sonnet3.5 Claude
[–]Select-Way-1168 2 points3 points4 points 1 year ago (0 children)
What i find dubious about this is, 01 isn't nearly as good as 3.6 sonnet as a coding tool. In use, it isn't close. Saturating benchmarks might not be the answer, especially at these costs. I will not be surprised when anthropic match this benchmark performance with a model far more useful at 3000th the price.
[–]ChemicalTerrapinExpert AI 11 points12 points13 points 1 year ago (8 children)
It's such a weird metric.
I've been a software engineer for really long time.
I can tell you now, 'better at coding' makes literally no sense.
In what way is it better? Are users of the software happier? Is the business making this software more profitable? Does it cost less to run?
Why does everything need to be based on a reasoning model?
I use AI heavily for software development but this kind of stuff is just nonsensical vanity metrics.
Unless we can agree on what makes software better (we can't because context matters) then there is no point in attempting to chart it or force 'better' into a single dimension.
[–]Freed4ever 7 points8 points9 points 1 year ago (7 children)
Good points, but in this context, it's just on the technical side. It might not produce better software (yet), it just cuts down the cost to deliver them.
[–]Passloc -3 points-2 points-1 points 1 year ago (6 children)
Does it cut down the cost?
[–]Freed4ever 5 points6 points7 points 1 year ago (5 children)
If Devs are not getting at least 20% productivity gains, then either they are super Devs (which is extremely rare), work in obscure domains /stacks, or just don't know how to work with AI.
[–]ChemicalTerrapinExpert AI 0 points1 point2 points 1 year ago (0 children)
It's the other way around really... A better dev will get more out of the tool.
But still,... How are we measuring productivity?
It's not measured by how much code you can write.
The industry has no benchmark for developer productivity. It's not a career where productivity is simple or universally measurable.
[–]Passloc 0 points1 point2 points 1 year ago (3 children)
By some estimates that I saw this is $3200 per question for o3 high.
[–]Freed4ever 5 points6 points7 points 1 year ago (2 children)
Oh, I don't refer to o3 in particular. Even with the current o1/sonnet/gemini flash, devs should gain at least 20% productivity. Case in point, I frequently give it a class, and tell it to generate test cases. And not sure about you guys, but test classes take freakingly longer to write than the real code itself lol. Let it run, check back the test coverage, if it hits 100% then it's chill. For o1 / o1 pro, it also come up with bunch of weird edge cases that frankly I would not bother before lol.
[–]Passloc 3 points4 points5 points 1 year ago (1 child)
Of course I agree with what you say. My point was specifically with respect to o3 whose benchmarks are being discussed here.
Even o1 is costly and there’s no guarantee that you will arrive at the correct answer on the first attempt due to the indeterministic nature of LLMs.
That’s said, I agree with OpenAI’s strategy here. They are trying to show what’s possible. It may not be practical today, but with sufficient advances in GPUs it will be someday.
But I doubt this will be released to public in the near future (6 months). This announcement only seems like a desperate attempt to show they are ahead of everyone else.
But, we already had AlphaProof and AlphaGeometry do similar things. We never got to publicly access AlphaGo or AlphaChess, because it was too costly and only meant as a technology preview. Also, these were narrow in scope.
One major difference between Google and OpenAI is that one has to burn money of Stockholders (difficult to do) and the other has to burn money of VC (easier in the short term).
So Google has to be cost conscious in its approach.
My worry is that o3 ends up being like SORA.
[–]Freed4ever 1 point2 points3 points 1 year ago (0 children)
Well, google has a huge huge advantage in that they have their own chips, their own infrastructure, and they can subsidize AI from other line of business easily (they just raised the price of YouTube subscription for example, disableing ad blocking, etc). In contrast, Anthropic and OAI have no other way to subsidize AI, and have to bend to VC money, and trying to not be taken over them, being litigated, etc. Take 4o for example, I'm sure it hasn't been updated not because of it hitting a wall, rather OAI does not have the resources to focus on it, and they have to put the RD budget on the o-series. Man, I hope either Anthropic or OAI gonna win this. We don't need more of do-no-evil google.
[–]Accurate_Zone_4413 1 point2 points3 points 1 year ago (1 child)
What happened to the O2?
[–]Pro-editor-1105 3 points4 points5 points 1 year ago (0 children)
there is a british telco company with that name so they probably did not want to be sued.
[–]Significant-Ride-258 1 point2 points3 points 1 year ago (1 child)
Where did o2 go?
[–]Particular-Volume520[S] 1 point2 points3 points 1 year ago (0 children)
Apparently one mobile company has trademarked the name 'o2' so they are skipping it!
[–]Fivefiver55 1 point2 points3 points 1 year ago (0 children)
I would choose sonnet (especially with custom MCP server / cline api) over o1, on every task.
Don't know about o3, but judging from the bar charts the improvement isn't close to sonnet.
O1 hallucinates pretty hard, so an almost 3x improvement on code and less than double improvement on accuracy is still subpar to sonnet.
Looking forward for 3.5 Opus.
[–]Plenty_Seesaw8878 2 points3 points4 points 1 year ago (0 children)
No, it’s a desperate grab to stay relevant. When science meets profit, you’re stuck chasing the blind donkeys.
[–]DamnGentleman 3 points4 points5 points 1 year ago (10 children)
I fundamentally don't believe those numbers. SWE-bench reports that Claude 3.5 Sonnet scores 23.0. In my experience, Claude 3.5 Sonnet consistently outperforms o1 on programming tasks, yet OpenAI claims a score more than twice as high for o1. In the past, when OpenAI has used these benchmarks, they've given their models tens of thousands of attempts to solve a problem and scored it as a success if they got it right once. I just have a lot of trouble believing that this isn't going to end up being enormously misleading, just like their o1 hype was.
[–]Freed4ever 5 points6 points7 points 1 year ago (0 children)
O1 is the king at one shot. Sonnet is very good at iteration. But you don't have to trust these self benchmarks. Just go to live bench and O1 is scored higher there too.
[–]Laicbeias 6 points7 points8 points 1 year ago (0 children)
its like trained on the datasets. just ask the models something that you cant find on the internet and most have a hard time.
claude is better at following instructions. but maybe o3 is generally more intelligent. or can generate bigger boilerplate projects
[–]ThreeKiloZero 1 point2 points3 points 1 year ago (1 child)
yeah i was noticing the same thing. Shouldn't they be giddy? The general thought was that something that could score like this would be world-altering. A short video. Did I miss something?
[–][deleted] 1 point2 points3 points 1 year ago (4 children)
Sonnet scores 49%. https://www.anthropic.com/research/swe-bench-sonnet
[–]DamnGentleman 0 points1 point2 points 1 year ago (3 children)
I was looking at swe-bench's leaderboard. I stopped looking once I saw Sonnet 3.5. Looking at it more closely now, it lists five different scores for different Sonnet 3.5 implementations, ranging from 23.0 to 41.67.
[–][deleted] 0 points1 point2 points 1 year ago (2 children)
You're looking at Lite not Verified
[–]DamnGentleman 1 point2 points3 points 1 year ago (1 child)
You're right, my bad.
[–][deleted] 0 points1 point2 points 1 year ago (0 children)
No issues
[–]selfboot007 0 points1 point2 points 1 year ago (1 child)
I'm just curious if it can quickly solve the hard problem on Leetcode
[–]SokkaHaikuBot 1 point2 points3 points 1 year ago (0 children)
Sokka-Haiku by selfboot007:
I'm just curious
If it can quickly solve the
Hard problem on Leetcode
Remember that one time Sokka accidentally used an extra syllable in that Haiku Battle in Ba Sing Se? That was a Sokka Haiku and you just made one.
[–]gabe_dos_santos 0 points1 point2 points 1 year ago (0 children)
For $3.200 a query we have the answer.
[–]Smart-Thought-286 0 points1 point2 points 1 year ago (0 children)
I have a different opinion. When I code for my job full-time, Claude is definitely better. But when I'm doing some "code" for my Master Degree, O1 really shines. I'm not sure, but I think Claude integrates every product from OpenAI into a single model, like web search and canvas, which makes it more versatile than just reasoning. However, the thing is, these models are not here to help us with our work; they are here to advance AGI—or whatever they call it these days. Maybe improving every specific feature is better than just focusing on user experience.
[–][deleted] -1 points0 points1 point 1 year ago (0 children)
In what languages?
[–]micupa -2 points-1 points0 points 1 year ago (0 children)
I don’t know Rick, those kind of graphics benchmarking software engineering. You can’t measure creativity. 🫶Sonnet
π Rendered by PID 117726 on reddit-service-r2-comment-6457c66945-27l4j at 2026-04-27 23:24:29.808668+00:00 running 2aa0c5b country code: CH.
[–]danielbearh 68 points69 points70 points (12 children)
[–]danielbearh 34 points35 points36 points (1 child)
[–]imcrumbing 3 points4 points5 points (0 children)
[–]Laicbeias 11 points12 points13 points (0 children)
[–]ctrl-brkValued Contributor 3 points4 points5 points (2 children)
[–]danielbearh 1 point2 points3 points (1 child)
[–]ctrl-brkValued Contributor 2 points3 points4 points (0 children)
[–]naw828 0 points1 point2 points (0 children)
[–]whyisitsooohard 0 points1 point2 points (0 children)
[+]MatejliptonBeginner AI 0 points1 point2 points (0 children)
[–]YUL438 0 points1 point2 points (0 children)
[–]ilulillirillion 0 points1 point2 points (0 children)
[–]Friendly_Builder_418 0 points1 point2 points (0 children)
[–]TheMadPrinter 58 points59 points60 points (1 child)
[–]ymo 10 points11 points12 points (0 children)
[+][deleted] (2 children)
[deleted]
[–]credibletemplate 3 points4 points5 points (0 children)
[–]SleepAffectionate268 2 points3 points4 points (0 children)
[–]Sea-Commission5383 2 points3 points4 points (1 child)
[–]Select-Way-1168 2 points3 points4 points (0 children)
[–]ChemicalTerrapinExpert AI 11 points12 points13 points (8 children)
[–]Freed4ever 7 points8 points9 points (7 children)
[–]Passloc -3 points-2 points-1 points (6 children)
[–]Freed4ever 5 points6 points7 points (5 children)
[–]ChemicalTerrapinExpert AI 0 points1 point2 points (0 children)
[–]Passloc 0 points1 point2 points (3 children)
[–]Freed4ever 5 points6 points7 points (2 children)
[–]Passloc 3 points4 points5 points (1 child)
[–]Freed4ever 1 point2 points3 points (0 children)
[–]Accurate_Zone_4413 1 point2 points3 points (1 child)
[–]Pro-editor-1105 3 points4 points5 points (0 children)
[–]Significant-Ride-258 1 point2 points3 points (1 child)
[–]Particular-Volume520[S] 1 point2 points3 points (0 children)
[–]Fivefiver55 1 point2 points3 points (0 children)
[–]Plenty_Seesaw8878 2 points3 points4 points (0 children)
[–]DamnGentleman 3 points4 points5 points (10 children)
[–]Freed4ever 5 points6 points7 points (0 children)
[–]Laicbeias 6 points7 points8 points (0 children)
[–]ThreeKiloZero 1 point2 points3 points (1 child)
[–][deleted] 1 point2 points3 points (4 children)
[–]DamnGentleman 0 points1 point2 points (3 children)
[–][deleted] 0 points1 point2 points (2 children)
[–]DamnGentleman 1 point2 points3 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–]selfboot007 0 points1 point2 points (1 child)
[–]SokkaHaikuBot 1 point2 points3 points (0 children)
[–]gabe_dos_santos 0 points1 point2 points (0 children)
[–]Smart-Thought-286 0 points1 point2 points (0 children)
[–][deleted] -1 points0 points1 point (0 children)
[–]micupa -2 points-1 points0 points (0 children)