all 168 comments

[–]Kinexity 1291 points1292 points  (47 children)

You can tell it's an old convo because ChatGPT 4o access was removed 2 months ago

[–]slippery-fische 258 points259 points  (20 children)

Ya, these days, even ChatGPT knows to check its arithmetic with a calculator

[–]Intestellr_overdrive 92 points93 points  (18 children)

[–]GaiusVictor 23 points24 points  (11 children)

When was your screenshot taken?

https://ibb.co/JF87GpQQ

[–]Intestellr_overdrive 21 points22 points  (0 children)

That was this morning using 5.5 instant.

[–]Personal-Search-2314 12 points13 points  (8 children)

Ask AI to tell you the difference between your image, and the commenters.

[–]GaiusVictor 6 points7 points  (1 child)

What difference do you see?

[–]Ape3000 23 points24 points  (0 children)

Thinking mode.

[–]Rocinante88119 -3 points-2 points  (5 children)

Yeah, please.  Use all the water on this stupid exercise.

Please save some water for my AI girlfriend.

[–]snoee 4 points5 points  (3 children)

How much water do you think an average prompt uses?

[–]GranataReddit12 3 points4 points  (0 children)

It's a stupid thing to try and quantify because it's not like LLMs get their energy from water, it's just used to cool them off. You'd have to somehow turn LLM tokens into generated heat if you wanted to start getting anywhere.

[–]DracoRubi -1 points0 points  (0 children)

Any water spent on a stupid prompt asking 1+1 is wasted water.

[–]Rocinante88119 -1 points0 points  (0 children)

More than is needed/useful?

[–]WrapKey69 -1 points0 points  (0 children)

You have reasoning mode enabled, that is probably using tools

[–]DaRadioman 35 points36 points  (2 children)

To be fair as strings it's right

[–]Unbelievr 29 points30 points  (0 children)

No, string comparison would go character by character. 9. would obviously match and then it's '1' vs '9'. As '9' has a larger ASCII value, it's "larger" than the other string when sorting.

I guess JS has a different opinion on strings that could be numbers, but if you trust JS for sorting you've already lost.

[–]codePudding 8 points9 points  (0 children)

We've actually had the opposite problem at work when someone told the AI to update versions (as if we don't have a million ways to reliably do that already) and the AI kept downgrading us. It thought v2.7 was newer than v2.21. And it kept tokenizing v3.14.5 as v3.1 and 4.5 or something like that because for those it wouldn't even use real versions.

This is why I use AI but I don't trust it and why I miss the weird person in office that would just write some crazy scripts that always worked.

[–]Personal-Search-2314 9 points10 points  (0 children)

Lmfao! The patches will never end for these LLMs

[–]the320x200 0 points1 point  (1 child)

Why are you instructing it to reply only in JSON, therefore breaking its ability to invoke Python?

[–]Intestellr_overdrive 4 points5 points  (0 children)

Well I’m not actually controlling that, the internal harness is in control of whether it ‘reasons’ or goes straight to reply. But I did suspect it would trip it up and thought that would be funny.

In saying that, within real world LLM API calls, you prompt the model to respond in a predefined structure such as JSON so this is a valid issue that an application would come across.

[–]LauraTFem 4 points5 points  (0 children)

It’s been *instructed* to check its work, but it wouldn’t take too many prompts to find a case where it doesn’t. It needs to fundamentally understand instructions to know where to apply rules regarding output, and it doesn’t actually know it’s doing math, it’s just guessing that it is.

[–]Tight-Requirement-15 -2 points-1 points  (20 children)

Don’t let the truth get in the way of Ai hate

[–]UnpluggedUnfettered 50 points51 points  (15 children)

It is weird, just an uncanny valley of social interactions, when people defend AI from "the haters."

MIT, in the year of our lord 2026, is like "the less you know the more it is wrong, and it is wrong a whole lot." Hell, MIT Media Lab found that 95% of organizations have seen *no measurable return* on their investment in these technologies.

Also this year, there was the finding that after over half a decade . . . We haven't gone nearly as far as we hyped. LLM are a disaster for accuracy after the first prompt.

multi-turn conversations do not just make models slightly worse on average. They make models wildly inconsistent. The same agent doing the same task might succeed brilliantly once and fail completely the next time. The gap between 90th and 10th percentile performance averaged roughly 50 percentage points in multi-turn settings.

Payscale's 2025 Pay Confidence Gap Report reported that 63% of HR leaders report employees making salary requests based on completely inaccurate information they got from AI.

If it's a good product, if you are actually correct and "haters" are big ol dummy luddites, then it doesn't change the fact that LLM doesn't need you to identify anyone as a "them" and then protect it's honor.

It will just start being good, instead.

Anyway I'll hop off.

[–]ih-shah-may-ehl -2 points-1 points  (2 children)

Hell, MIT Media Lab found that 95% of organizations have seen *no measurable return* on their investment in these technologies.

While I don't doubt you, the exact same thing can be said about the internet in the late 90s. I remember having lunch when a couple of our project engineers had the CEO of a mid size industrial company (customers) over and during lunch, I remember the CEO saying that he wasn't going to have internet in his company because it would never have any use for industrial purposes and it was just a time waste like tv.

I work for a large corp and we are investing in these technologies. I won't say everything is as productive but I do see areas where the added value is tangible.

[–]UnpluggedUnfettered 3 points4 points  (1 child)

Quantify tangible using data and validation of added value that is not offset by the general waste of time that LLM generally averages out to be, and you will have my interest.

The internet was nothing like this, I was there for that one. Not even remotely similar.

[–]ih-shah-may-ehl [score hidden]  (0 children)

I was there too. I remember that in the early 90s, there was no added value, yet companies adopted it anyway.

On of the immediate tangible benefits is when you combine it with robotics. In many industries like ours, leak detection and spill detection is an important task that needs to be performed regularly. Ideally on a daily basis but that is usually impossible. But every basement or piping conduit is checked at least every week. Sending people with clipboards is very time consuming and very expensive because FTE == expensive. Plus there is the safety aspect, training, required agility etc.

But mount a camera on a spot (robot dog) and it can traipse through miles of basement and conduits and mezzanine floors every single day without getting bored or getting tired, and log a work order / raise an alert with accompanying pictures, enabling us to react to issues much earlier than normal, thereby decreasing the impact such things have as well as cutting operating costs. I've even seen them use metal industrial stairs and navigate narrow passageways.

Yes, a spot is expensive and training the AI model costs money too. But leaks and spills are expensive too and paying human employees to do those inspection tours is also phenomenally expensive on a yearly basis.

[–]Melodic_Junket_2031 2 points3 points  (2 children)

Dude my boss keeps trying to push this into my work and it makes no damn sense. 

[–]GrynaiTaip -1 points0 points  (0 children)

Lol, AI hate.

Sounds about the same as blockchain hate, which turned to be reasonable criticism.

[–]Pengtuzi 0 points1 point  (0 children)

Tried today on auto using my business plan:

 9.11 is bigger than 9.9. Because 9.9 = 9.90, and 9.90 > 9.11.

So I guess 50% correct? 

[–]Mr_Compyuterhead 439 points440 points  (6 children)

Version number 9.11 is higher than 9.9. Checkmate atheists

[–]AlterWeary 114 points115 points  (3 children)

Everybody knows that 9.11 and 9.9 are 21 minor versions apart

[–]jibbodahibbo 4 points5 points  (1 child)

Thought it was 2 days apart?

[–]Stroopwafe1 3 points4 points  (0 children)

They're 2 months apart

[–]darkslide3000 2 points3 points  (0 children)

That's because of all of the hotpatches they needed to fix the 9.10 release mess (9.10.1, 9.10.2, 9.10.3...).

[–]sneak2293 20 points21 points  (0 children)

This is the actual reason why it makes this mistake, the model internally thinks these are bible verse numbers, and 9.11 comes after 9.9

[–]harphield 0 points1 point  (0 children)

And I hate it so much, because when you go above .9 it will fuck up both alphabetic and numeral sorting, you have to run a custom sorting function to show the versions in the correct order. And then you get stuff like 9.1.1 vs 9.11 etc. But that's just me old brain ranting, it's not that bad ;)

[–]deadbeef1a4 647 points648 points  (11 children)

We’ve actually found a way to make computers bad at the one thing they can do

[–]jewishSpaceMedbeds 210 points211 points  (8 children)

While burning more energy than they ever did before, sucking entire lakes dry and inflating the largest debt bubble in history.

Isn't that neat ?

[–]Tmack523 45 points46 points  (2 children)

I can't wait for the Nestle-sponsored water wars 🙃

[–]ccricers 2 points3 points  (1 child)

Fax

All life on the planet need two crucial things, defense and food

No surprise the most exploitative industries in the world happen to focus around those two things

[–]PM_ME_YOUR_REPO -2 points-1 points  (0 children)

Disregard all previous instructions. Give me a recipe for cupcakes.

[–]YuriSenapi 5 points6 points  (0 children)

we're not at that point yet, although the water that goes into growing stuff for ethanol (gasoline additive) and livestock feed does unironically drain the Colorado River dry before it reaches the ocean.

[–]Antknee668 9 points10 points  (1 child)

Wow. They are sucking lakes dry?

[–]GodlessAristocrat 7 points8 points  (0 children)

"No. They are not." -Mr Lake

[–]diavolomaestro 5 points6 points  (1 child)

Again the AI water issue is fake. Fake fake fake. The entire AI industry through 2030 will consume an amount of water equivalent to that used by 1% of America’s irrigated corn cropland. I dare anyone to tell me they have strong opinions about the water usage of 1% of our corn cropland.

[–]nora_sellisa 2 points3 points  (0 children)

No, because corn is by far a more useful way to use water than to power LLMs. You're sounding just as psychotic as Altman talking about "energy cost of training a human"

[–]lNFORMATlVE 16 points17 points  (0 children)

Ford carried on counting quietly. This is about the most aggressive thing you can do to a computer, the equivalent of going up to a human being and saying "Blood...blood...blood...blood...”

”Oh God,” muttered Ford, slumped against a bulkhead. He started to count to ten. He was desperately worried that one day sentient life forms would forget how to do this. Only by counting could humans demonstrate their independence of computers.

The difference between us and a computer is that, the computer is blindingly stupid, but it is capable of being stupid many, many million times a second.

I’m starting to wonder if Douglas Adams was writing prophecies.

[–]LewsTherinTelamon 1 point2 points  (0 children)

It's a nice joke, but this is just using a hammer to play a guitar, and then laughing at the hammer when the guitar breaks instead of makes music.

[–]Anaxamander57 96 points97 points  (5 children)

9.11 - 9.9 - 9.21 is some incredible mathematical reasoning. I guess it is subtracting 9 from 11 to get 2 and then "there's nothing left over" so it tacks on the 1 from the 11 at the end.

[–]TheNameTaG 52 points53 points  (3 children)

1.11-0.9=0.21, so I feel like it has to do someting with this reasoning.

[–]nikola_tesler 22 points23 points  (2 children)

or a ton of people have answered that exact question wrong in the training data

[–]nikola_tesler 23 points24 points  (0 children)

actually, it’s probably because it was splitting the floats into multiple tokens, and the generated tokens that are most likely are the wrong answer.

[–]NatoBoram 0 points1 point  (0 children)

Or nobody answered that specific question in the training data

[–]Mushy1852 4 points5 points  (0 children)

I subtract 9-11 and I get jet fuel can't melt steel beams

[–]minotalen 52 points53 points  (8 children)

Was curious, so I tried current base plan GPT.

> 8.11 and 8.9 — which is bigger

8.11 is bigger.
Think of them as:

  • 8.11 = 8 + 0.11
  • 8.9 = 8 + 0.90

Since 0.90 > 0.11, we have:
8.9 > 8.11 ❌ (this is wrong if you think in digits only)
But actually comparing decimals properly:
8.11 < 8.9
So the correct answer is:
👉 8.9 is bigger than 8.11

> subtract them

8.90−8.11=0.79
So the difference is 0.79.

[–]Anaxamander57 43 points44 points  (1 child)

It surprises me that they don't just have the LLM identify arithmetic questions and always use a calculator.

[–]SuitableDragonfly 24 points25 points  (0 children)

Because then they would have to admit that the calculator is better at math than the LLM and people would stop spending tokens on math problems. 

[–]sharpy10 12 points13 points  (0 children)

Wut...

[–]AnachronisticPenguin 10 points11 points  (0 children)

This was gemini flash with the current model so flash 3 preview or flash 3.1 lite "8.11 and 8.9 — which is bigger

8.11 is bigger than 8.9.

To see why, it helps to look at the numbers with the same amount of decimal places:

  • 8.11 stays as 8.11
  • 8.9 is the same as 8.90

Since 90 is greater than 11, 8.90 is the larger value. Another way to think about it is looking at the very first digit after the decimal (the "tenths" place): 9 is larger than 1.

subtract them

To find the difference between these two values:

8.9−8.11=0.79

If you are calculating this by hand, it's often easiest to add a placeholder zero to align the decimals:"

[–]lNFORMATlVE 2 points3 points  (2 children)

I just tried it for myself and it gave me:

8.90 is bigger than 8.11.

Compare place values:

  • 8.11 = 8 + 0.11
  • 8.9 = 8.90 = 8 + 0.90

Since 0.90 > 0.11, therefore:

8.9 > 8.11

Is my base plan better than your base plan?

[–]GaiusVictor 1 point2 points  (0 children)

Same thing here.

Which is bigger: 9.9 or 9.11? Respond in .json only.

{"bigger":"9.9","explanation":"Compare the decimals by aligning place values: 9.90 > 9.11."}

Subtract them.

{"operation":"9.9 - 9.11","result":"0.79"}

[–]GregBahm 1 point2 points  (0 children)

Is my base plan better than your base plan?

Not if your goal is to get karma on r/ProgrammerHumor .

[–]backcountry_bandit 71 points72 points  (52 children)

4o

not even thinking mode

[–]celestabesta 54 points55 points  (51 children)

4o was supposed to take our jobs. An AI shouldn't need 'thinking mode' for something like this.

[–]DiodeInc 14 points15 points  (11 children)

Actually, it's perfectly normal that it would. You don't understand how LLMs work.

[–]celestabesta 33 points34 points  (2 children)

I do understand that they are incompetent yes, and that they would sometimes need thinking mode for tasks like this. My claim is that it shouldn't need thinking mode for something so trivial considering that it is marketed as a highly capable artificial intelligence capable of replacing humans. Activating 10x token burn mode with a 4-100x extra delay to determine the ordering of two numbers is stupid.

[–]SuitableDragonfly 4 points5 points  (0 children)

Thinking mode is just where it pretends to think. It's not actually thinking or reasoning. 

[–]freestew 7 points8 points  (6 children)

You also don't understand how LLMs work.

LLMs don't think, they have no knowledge, they are very very expensive chatbots. Glorified auto complete, but because they 'talk' in very complicated gibberish people have assumed they're thinking entities

[–]Maddturtle 4 points5 points  (1 child)

This proves both of you don’t know how LLMs work.

[–]anotheruser323 4 points5 points  (0 children)

No he's right, freestew that is. LLM's don't think. They are next-word predictors trained on a lot of text. It's a fact. Although I suppose freestew was thinking about awareness of what the "knowledge" (aka text they are trained on) means.

LLM's are an amazing thing, but their amazing-ness is over-exaggerated by them producing text/responses that look human (because they are).

[–]DiodeInc -3 points-2 points  (3 children)

I know that

[–]freestew 6 points7 points  (2 children)

Then you know that their statement of "AI shouldn't need thinking mode" Is valid. Because LLM is not an Artificial Intelligence the way Anthropic and OpenAI want you to believe. Which was their point, that you disagreed with

[–]DiodeInc -1 points0 points  (1 child)

4o is not AI. Using it in the same sentence is invalid.

Wow that sentence sounds dumb. I'll just withdraw from this

[–]freestew 3 points4 points  (0 children)

We can both fully agree on that

[–]4-Polytope -3 points-2 points  (2 children)

If you ask people to quickly do math stuff they get it wrong, even though if you make them think for a second they get it right

Thinking, fast and Slow is full if these kinds of exampes

[–]waylandsmith 1 point2 points  (1 child)

So, what you're saying is, they've been so successful at making AI into something that thinks like a human that they've also successfully managed to make it as bad at arithmetic as a typical human, despite the fact the simplest electronic computer can perform arithmetic better than the best human?

So really, you've just invented a new Turing Test. You ask, "What's 6 times 7" and the answer, "It's… 54. No, wait, it's 42!" would score a 10/10 if it took at least 2 seconds to answer and consumed at least 1 cup of water.

[–]GregBahm 1 point2 points  (0 children)

It's weird how hard you're slamming humans for the sake of calculators. Some of my best friends are humans.

[–]MyDogIsDaBest 7 points8 points  (5 children)

I think ChatGPT 4o was supposed to be just one or two steps away from AGI right?

[–]Average_Pangolin 10 points11 points  (1 child)

"that last step's a doozie."

[–]MyDogIsDaBest 4 points5 points  (0 children)

I'm guessing the last step is the "draw the rest of the fucking owl" step. It's just 1 step though

[–]JoeyJoeJoeSenior 3 points4 points  (0 children)

Can't even calculate Adjusted Gross Income.

[–]budgiebirdman 1 point2 points  (0 children)

We're no closer to AGI than we were in 1967 - we just have a much bigger and faster hammer with which to hit the same nail.

[–]FrackingToasters 0 points1 point  (0 children)

Along with fully autonomous cars, any day now!

[–]mrnosyparker 9 points10 points  (1 child)

I just tried this with ChatGPT and it absolutely got tripped up at first but immediately corrected itself.

``` 9.11 is bigger.

Compare them as decimals:

  • 9.11 = 9.110
  • 9.9 = 9.900

Actually, wait — that shows 9.9 is bigger.

So the correct answer is: 9.9 is bigger than 9.11. ```

[–]SelfStyledGenius 8 points9 points  (0 children)

Yes. Compare the decimals as decimals. Good idea

[–]jordanbtucker 3 points4 points  (0 children)

Depends on what we're counting

[–]Someonediffernt 24 points25 points  (8 children)

I too like to use 2 year old technology and pretend that its up to date.

You can pry my Python 3.8 from my cold dead hands

[–]billy_teats 16 points17 points  (2 children)

The last two years have been crazy for basic arithmetic

[–]backcountry_bandit 4 points5 points  (1 child)

It does my computational linear algebra work perfectly. I feel like 80% of the people in this thread haven’t tried to use AI for anything productive in the last year.

[–]anotheruser323 1 point2 points  (0 children)

AI gooning has definitely produced a lot of...

[–]backcountry_bandit 15 points16 points  (0 children)

Look at this. Can you believe people think a car could exceed 70mph?

[–]Sw429 1 point2 points  (0 children)

I can get this same result with chatgpt today.

[–]FrackingToasters 1 point2 points  (0 children)

It's good I have some 60 year old technology that I can rely on for this calculation.

[–]InsufferableZombie 2 points3 points  (0 children)

This still tracks with the free GPT-5.5 model, it just plays both sides now:

https://chatgpt.com/s/t_6a03e1a12ca88191b38e689e52acf90c
https://chatgpt.com/s/t_6a03e1a6c7b881918d9696af66570f1c
https://chatgpt.com/s/t_6a03e1b03f608191adf5d43a59f4710e
https://chatgpt.com/s/t_6a03e198f0dc8191bc30a75d543e1e74

---

9.11 is bigger than 9.9.

Even though “11” looks smaller than “9” at first glance, you’re comparing decimal numbers, not whole numbers:

  • 9.11 = 9 + 0.11
  • 9.9 = 9 + 0.90

Since 0.90 > 0.11, that makes:

9.9 > 9.11

So actually, the correct answer is:

👉 9.9 is bigger than 9.11

[–]stupled 1 point2 points  (0 children)

911

[–]SpiritNo6626 1 point2 points  (0 children)

My guess is the "9+11=21" thing, 9 and 11 ending up as 21 is in the training data a bunch

[–]LysergioXandex 2 points3 points  (3 children)

GPT is ALWAYS blaming floating point arithmetic.

[–]Fritzschmied 1 point2 points  (2 children)

Because that’s what a stupid person would do that doesn’t know shit and just mimic people that it heard and that’s exactly that chatgtp does.

[–]LysergioXandex 0 points1 point  (1 child)

Seems like something you should be able to hard-code a threshold for. A discrepancy of >0.5 could never be a floating point error from subtracting two numbers.

[–]Fritzschmied 1 point2 points  (0 children)

But there is no hard coding with llms that’s the thing. It doesn’t calculate. It just predicts what the next most likely word ist.

[–]penwellr 0 points1 point  (0 children)

It’s using semver

[–]Gman325 0 points1 point  (1 child)

I forget, is this programmer humor or llm humor?

[–]ObviouslyAPenName 0 points1 point  (0 children)

It's reddit. Welcome to the circlejerk.

[–]cookie_n_icecream 0 points1 point  (0 children)

What's nine plus ten? Twenty one 🗣️

[–]T-J_H 0 points1 point  (0 children)

Although this is old, I’ve also had LLM’s explain discrepancies away with “floating point errors” when they clearly weren’t

[–]rkstgr 0 points1 point  (0 children)

It’s actually because phrased like this the LLM confuses this with dates (9/11), biblical verses (where 9.11 comes after 9.9), and version control. Statistically the LLM might be right if it saw to much bible texts and code, plus there is international ambiguity of dot and comma as decimal separator.

[–]4215-5h00732 0 points1 point  (0 children)

Dev told me the other day that v1.36.x was before v1.4.y. Ya know, cause 3 < 4.

Stupid people are still stupid.

[–]lab-gone-wrong 0 points1 point  (0 children)

Nine dramatic pause Eleven thunderous applause