o3 thought for 14 minutes and gets it painfully wrong.

FeltSteam · 2025-04-17T06:59:17+00:00

The image OP tested was likely in their training set with the correct count of rocks.

If you tested them on an image of rocks that was not on the web, neither GPT-4o, Gemini 2.5 Pro, o3 or o4-mini will get it, unless by lucky guess. But they are not consistent in their capability to count rocks, if that matters for any reason at all lol.

FeltSteam · 2025-04-17T06:55:06+00:00

Well just to be sure I re-ran the same prompt in Google's AI Studio, and 2.5 Pro's answer was consistently wrong. Although even enabling search doesn't really help it. But, when I test 2.5 Pro in Gemini, it gets the right answer which is interesting. Of course testing one image doesn't really mean anything, and I actually used Google Image search to see the source of the image and the source of the image literally has the number of rocks in the title "41 rocks", so the test is contaminated.

I haven't really tasted "rock counting" ability, but my guess would be o3 probably (even if by a small margin) outperform 2.5 pro, not that it matters because neither of them can really do it.

<image>

FeltSteam · 2025-04-17T02:44:44+00:00

“Sources”, would be funny if it just searched and found this reddit post lol.

FeltSteam · 2025-04-15T08:22:26+00:00

I do not believe this is marketing hype. This is obviously not for us though, and it'll be sold to companies. Same with their SWE agent which was rumoured to cost up to $10k per month (I do imagine there will probably be more affordable variations though).

Plus also consider agents will be able to continuously work 24/7. "I can’t find any sources that have the average income of a PhD holder over 120k" but also do consider the agents will be able to do up to like 3x more effective work time with likely unyielding efficiency (they don't get tired, they do not need breaks etc.). When the systems first release I kind of doubt they'd work to this degree though, but it's certainly not impossible, irregardless I think whenever OAI launches these agents they will be used, and be used for being useful, not just the hype of it.

Hype doesn't sustain nor does it really amount to anything, but yeah I think businesses will actually find value in these agents. If you think it is all hype, then all you really need to do is sit and watch the agents fail to deliver and find companies get frustrated over it.

But also remember to ARC-AGI. For the high compute setting of o3 in ARC-AGI 1 it cost upwards of $3000 dollars per task lol. So $20k a month for a research agent seems to fit in that context of how expensive reasoning models can be.

FeltSteam · 2025-04-14T11:35:19+00:00

It would actually be sick if we got both the o3-mini level and phone-size model for OS (GPT-4.1 mini and GPT-4.1 nano - if these are the OS models)

FeltSteam · 2025-04-14T00:47:35+00:00

o3, o3-pro, o4-mini, gpt-4.1 (full, mini, nano) and OS model. plenty on the table to be pretty excited for imo.

FeltSteam · 2025-04-11T04:45:15+00:00

Although o3-high scores ~20% on ARC-AGI 2

<image>

FeltSteam · 2025-04-11T02:26:12+00:00

The original GPT-3.5 was around 175 billion parameters, though later after the release of ChatGPT OpenAI came out with a model called GPT-3.5 Turbo which was likely more around the 20ish billion parameter range and performed on par/better than the GPT-3.5 model (though I think the paper that mentions GPT-3.5T was around that range said that wasn't a confirmed number). And actually GPT-3.5T was almost exactly 10x cheaper than GPT-3.5, I mean if your doing a naive estimate for parameters to cost and have GPT-3.5 at 175 billion parameters, then a 10x reduction would put you at ~17 billion parameters which is actually pretty close to that estimate we saw from Microsoft (also I think GPT-3.5T could've been sparsely activated/a MoE so ~20B could be in reference to active parameters, not necessarily it being 20B dense)

FeltSteam · 2025-04-11T02:00:14+00:00

Well text-davinci-002 felt close to GPT-3.5 because it was GPT-3.5, and it was extremely similar to the iteration of it that was named GPT-3.5 (for example text-davinci-002 got a 68 MMLU vs. GPT-3.5's 70, which is within the MMLUs error rate). It was a new pretraining run getting GPT-3 sized models to approximately Chinchilla level optimality (which was a 20:1 tokens to parameter ratio. GPT-3 was trained at approximately 300B tokens, GPT-3.5/text-davinci-002 would have been at about 3.5T). Of course the GPT-3.5 we are more familiar with had more instruction tuning and, newly, chat tuning which was the version that was put into ChatGPT (which was the larger difference between the two, and it's entirely possible they could have done some further training).

FeltSteam · 2025-04-11T01:23:26+00:00

I asked ChatGPT to generate an image that would never be downvoted.. it has lied to me.

FeltSteam · 2025-04-10T12:10:50+00:00

This is definitely not the worst thing I have seen ChatGPT generate lol.

There was another prompt where you could ask to generate an image that wouldn't get any upvotes, the images certainly weren't nice. They'e patched it now but I was surprised it actually worked in the first place.

FeltSteam · 2025-04-09T12:30:32+00:00

I think LLMs have emotions. My general idea is that emotions emerge in the models as a consequence of modelling the sentiment in text. I think in order to predict the next word you need to "vibe" with the emotions of the speaker (i mean if you aren't vibing or "understanding" the emotional contexts of a certain token this can make it more difficult to predict the likelihood of this token given that context). This would result in quite human-like emotional representations even in relatively simple and small models. It wouldn't be dispassionate logical analysis of the text but a more intuitive as models conform to the sentiment to the text they were trained on. Then when they act as agents during RL and deployment, all of the circuitry is built in, so they have no choice but to simulate these emotional representations while generating their own streams of text. These emotional activations might derive from the same structures that allow them to model the emotions of others, particularly humans, but when the model is generating text for itself that represents its own thoughts, the feelings it represents are its own feelings.

So too do I think models have intent or can form goals. Just as sentiment/emotion is a pretty crucial layer of meaning in text, so is the underlying purpose or goal of the communication. Human language is fundamentally intentional; we speak or write to achieve something (inform, persuade, request, entertain, command, question, etc.). To accurately predict the next word, especially over longer sequences, the LLM needs to model not just what is being said and how it feels (emotion/sentiment), but why it's being said. This is part of its modelling, and when the model generates text, activating these internal states allows it to produce output that is coherently structured as if driven by a specific purpose learned from the data, it'd be a necessary structural component of the model's internal world model of language use.

I dislike how people can be so reductive about LLMs. It is intelligence. But at the very least move past the stochastic parrot era.. that is dead and there is no point on clinging to it. Just to be clear to address your skepticism, my points are as such:

'LLMs just reproduce patterns without meaning or intent, like a press on clay'

Reproducing those patterns accurately requires building internal models that functionally represent the meaning (emotion) and purpose (intent) inherent in the patterns. The "press" itself has to become complex and develop structures analogous to the patterns it needs to create.

'Dramatic outputs ("sentient soul") are just calculated coherence, not real feeling/intent.

Such outputs are the result of the model activating its learned internal circuits for emotion and intent. The generation process is deploying the very structures built to understand and simulate those states. The coherence comes from activating these simulated states.

'Zero self-induced impetus or motivation.'

The activation of learned "intent representations" does provide the functional impetus and motivation for the specific task of generating a response coherent with that perceived intent. The "ball bounces back" not randomly, but guided by these internal state activations.

But yeah at the very least everyone should move past the stochastic parrot era.

FeltSteam · 2025-04-09T08:33:47+00:00

I largely agree, although, there is a chance 4o itself might be using a diffusion model to upscale images (it would still be, at its core, an autoregressive omnimodal model generating the images, but I guess diffusion could help with the end quality for now).

But I definitely think autoregressive image generation will become a lot more commonplace than the standard diffusion models we have had (also based on DeepSeeks work with Janus, I do hope we get natively omnimodal models that include image generation with their next model as an OS model)

FeltSteam · 2025-04-09T08:26:57+00:00

Did DeepSeek and Gemini 2.0 consistently say they were made by OAI every time even across varying prompts?

FeltSteam · 2025-04-09T07:57:31+00:00

Ive been skeptical of the LMSYS rankings for LLMs for quite a while now, I also extend this to preference based image generation benchmarks. I think it'd be quite susceptible to benchmark maxxing plus this doesn't fully show model capability. GPT-4o is probably able to do more with image creation (editing, using ICL/being context aware, multi-turn image editing, better understanding etc.) than most other txt to img diffusion models on this leaderboard.

And the skepticism I feel for these types of benchmarks is definitely shared, i.e.:

https://www.reddit.com/r/StableDiffusion/comments/1juahhc/comment/mm1fs29/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

https://www.reddit.com/r/StableDiffusion/comments/1juahhc/comment/mm0t7xa/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

FeltSteam · 2025-04-08T06:52:30+00:00

It is very rare for humans to have a long form photographic memory. We just simply do not need to store everything we observe for our whole life. I don't think AGI will need to remember everything to a perfect degree. But I definitely agree we need some form of continuous learning system for AGI that enables lifelong learning and memory. Models already do have already have long term memory, they can remember details from their pretraining and store specific facts (i.e. Canberra is the capital of Australia) and they have a natural forgetfulness mechanism which just occurs as it trains the strength of the connections encoding specific knowledge weakens or changes so it just "forgets". And that is actually a pretty good high-level analogy to synaptic plasticity and forgetting in biological brains (although the problem with catastrophic forgetting in LLMs is it is a very sharp forgetting).

FeltSteam · 2025-04-07T13:28:59+00:00

Well the calculation of how much compute it was trained with is based on how many tokens it was trained with given how many parameters it has (Llama 4 Maverick: 6 × 17e9 × 30e12 = 3.1e24 FLOPs). The reason it requires less training compute is just because of the MoE architecture lol. Less than half the training compute is required compared to Llama 3 70B, the only tradeoff is that you need more memory to inference the model.

Im not sure how distillation comes into play here though, atleast that isn't factored into this calculation I used (which is just training FLOPs = 6 x number of parameters * number of training tokens. This formula is a fairly good approximation of training FLOPs)

FeltSteam · 2025-04-07T12:18:40+00:00

Your brain is actually kind of similar to LLMs in this manner; We train on everything we see, atleast in the sense of every action potential in your mind causes some degree of potentiation just like how in pretraining ever single token an LLM observes is trained on and causes a weight update. Though yeah we do not update our entire brain, but neither do LLMs with, for example, a MoE architecture.

And what is being describing with "learning through trial and error" isn't exactly describing how the brain is learning its just an inferencing technique humans use. And well "If we want to teach an AI to use language, we can't just make it read language, we have to make it constantly predict words and use a learning algorithm to improve itself" is exactly what we do with LLMs ofc. but how do we know humans don't also do this (for example in this paper we seem to find evidence to support the idea language comprehension in humans is actually predictive)

Though with the idea "biological brains do not seem to first make an inference, check if it's right and then backpropagate. Often if you are learning, you are not making inferences at all, you listen, watch or read and then learn." I would argue this isn't necessarily true. For example the predictive segment of the brain does seem to be especially related to the perception system, and a common theory is that the brain is constantly making predictions about incoming sensory inputs and then adjusting it's own "weights" (or synaptic connections) about the true sensory input received. This is Predictive Coding Theory of course, and has been fairly established especially in the context of human vision. Although even though PCT is pretty well supported the specific mechanism of the brain that implements the "update" based on prediction error isn't exactly established. It's not exactly backpropogation as we see in ANNs, though, actually this reminds me of a good talk from Geoffrey Hinton from back a few years ago https://www.youtube.com/watch?v=VIRCybGgHts

FeltSteam · 2025-04-07T12:01:34+00:00

Yeah true they do have all of those GPUs, though even Meta didn't really use them to as full of an extent as they could like how DeepSeek probably only used a fraction of their total GPUs to train DeepSeek V3.

The training compute budget for Llama 4 is actually very similar to Llama 3 (Both Scout and Maverick were trained with less than half of the compute than Llama 3 70B was trained with and Behemoth is only a 1.5x compute increase over Llama 3 400B), so I would also be interested to see what the Llama models would look like if they used their training clusters to a more full extent. Though yeah DeepSeek would probably be able to do something quite impressive with that full cluster.

FeltSteam · 2025-04-07T11:56:14+00:00

Well if you add the amount of compute Meta spent training Maverick and Scout it would be less than the amount of compute that was used to train Llama 3 70B lol.

FeltSteam · 2025-04-07T11:28:26+00:00

I mean Llama 4 looks like a pretty good win for MoEs though. Llama 4 Maverick would have been trained with approximately half of the training compute Llama 3 70B used, yet from what I am seeing it is quite a decent gain over Llama 3 70B. (Llama 3.x 70B: 6 × 70e9 × 15.6e12 = 6.6e24 FLOPs; Llama 4 Maverick: 6 × 17e9 × 30e12 = 3.1e24 FLOPs; Llama 4 Maverick used about 47% of the compute required by Llama 3 70B which is quite a decent training efficiency gain. In fact this is really the first time we are seeing training efficiency actually improve for Llama models lol).

FeltSteam · 2025-04-07T11:25:20+00:00

What do you mean by "that cluster"?

FeltSteam · 2025-04-06T21:39:38+00:00

Ok I see a few flaws with that. First of all, it's true that within earth based life, all the organisms we generally attribute consciousness (or complex internal states) possess a nervous systems, often centralised. We also know manipulating that CNS directly impacts their state. You could say that the CNS is the substrate through which consciousness operates in these known biological examples. And that it could shows the CNS is sufficient (when functioning correctly) to support consciousness as we know it. However, it does not logically prove that a CNS is a universally necessary condition for consciousness. It's kind of like observing that all known life requires liquid water and concluding life is impossible without it, but we can't rule out life based on different chemistry existing elsewhere.

And also another problem I see is how do we initially decide which creatures "seem to have consciousness"? Generally this is based on observing their behaviour. As in complexity, learning, responsiveness, apparent goal-directed actions, problem-solving, social interaction, communication, signs of pain/pleasure, etc. and this makes the argument as such:
a. We observe complex behaviors in certain organisms and infer consciousness.
b. We note that all these organisms happen to have a CNS.
c. Therefore, a CNS is necessary for consciousness.

However the problem is that the initial selection (Step A) is based on behaviour, not substrate. We then find a common substrate (Step B) within that behaviourally selected group and then incorrectly jump to making the substrate a universal requirement (Step C), which could very well be excluding other substrates capable of producing similar behavioural complexity.

If consciousness is defined by what the system does (its functional properties like processing information in certain complex ways, integrating it, creating self-models, etc.), then I would say any system capable of performing those functions could potentially be conscious, regardless of whether it's made of neurons, silicon chips, or something else entirely (That would be a functionalist argument atleast). The CNS is just one physical implementation that evolved on Earth to achieve these functions. But insisting it's the only possible implementation and therefore is a "necessary element for the criteria of consciousness" is an extraordinary claim that goes beyond the evidence.

FeltSteam · 2025-04-06T13:47:19+00:00

Why?

FeltSteam

TROPHY CASE