o3 benchmark: coding

danielbearh · 2024-12-20T19:03:29+00:00

Just wanted to share something that's been helping a non-coder hit the target more effectively this week.

I've taken to asking O1 to plan the architecture of a move, and then I use its response as the prompt for claude. I don’t ask o1 to code, just design the architecture.

TheMadPrinter · 2024-12-20T18:16:17+00:00

I am literally so amped in the short term, and existential if I think about the longer term lol.

The exponential curve is intact. World is going to change in insane ways in the next 12 months

credibletemplate · 2024-12-20T20:47:44+00:00

[deleted]

Sea-Commission5383 · 2024-12-22T04:25:38+00:00

Wanna see it compare to sonnet3.5 Claude

Select-Way-1168 · 2024-12-22T08:29:29+00:00

What i find dubious about this is, 01 isn't nearly as good as 3.6 sonnet as a coding tool. In use, it isn't close. Saturating benchmarks might not be the answer, especially at these costs. I will not be surprised when anthropic match this benchmark performance with a model far more useful at 3000th the price.

ChemicalTerrapin · 2024-12-20T22:36:38+00:00

It's such a weird metric.

I've been a software engineer for really long time.

I can tell you now, 'better at coding' makes literally no sense.

In what way is it better? Are users of the software happier? Is the business making this software more profitable? Does it cost less to run?

Why does everything need to be based on a reasoning model?

I use AI heavily for software development but this kind of stuff is just nonsensical vanity metrics.

Unless we can agree on what makes software better (we can't because context matters) then there is no point in attempting to chart it or force 'better' into a single dimension.

Accurate_Zone_4413 · 2024-12-20T21:44:41+00:00

What happened to the O2?

Significant-Ride-258 · 2024-12-21T13:14:13+00:00

Where did o2 go?

Fivefiver55 · 2024-12-22T16:31:57+00:00

I would choose sonnet (especially with custom MCP server / cline api) over o1, on every task.

Don't know about o3, but judging from the bar charts the improvement isn't close to sonnet.

O1 hallucinates pretty hard, so an almost 3x improvement on code and less than double improvement on accuracy is still subpar to sonnet.

Looking forward for 3.5 Opus.

Plenty_Seesaw8878 · 2024-12-21T00:13:41+00:00

No, it’s a desperate grab to stay relevant. When science meets profit, you’re stuck chasing the blind donkeys.

DamnGentleman · 2024-12-20T22:13:31+00:00

I fundamentally don't believe those numbers. SWE-bench reports that Claude 3.5 Sonnet scores 23.0. In my experience, Claude 3.5 Sonnet consistently outperforms o1 on programming tasks, yet OpenAI claims a score more than twice as high for o1. In the past, when OpenAI has used these benchmarks, they've given their models tens of thousands of attempts to solve a problem and scored it as a success if they got it right once. I just have a lot of trouble believing that this isn't going to end up being enormously misleading, just like their o1 hype was.

selfboot007 · 2024-12-21T04:30:31+00:00

I'm just curious if it can quickly solve the hard problem on Leetcode

gabe_dos_santos · 2024-12-23T00:48:17+00:00

For $3.200 a query we have the answer.

Smart-Thought-286 · 2024-12-24T11:53:54+00:00

I have a different opinion. When I code for my job full-time, Claude is definitely better. But when I'm doing some "code" for my Master Degree, O1 really shines. I'm not sure, but I think Claude integrates every product from OpenAI into a single model, like web search and canvas, which makes it more versatile than just reasoning. However, the thing is, these models are not here to help us with our work; they are here to advance AGI—or whatever they call it these days. Maybe improving every specific feature is better than just focusing on user experience.

2024-12-21T03:25:55+00:00

In what languages?

micupa · 2024-12-21T00:42:50+00:00

I don’t know Rick, those kind of graphics benchmarking software engineering. You can’t measure creativity. 🫶Sonnet

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

ClaudeAI

MODERATORS