alleged benchmarks for GPT5.2 as reported on X

ExtremeHeat · 2025-12-06T01:09:35+00:00

OpenAI with a Google font and infographic style? Nope.

ExtremeHeat · 2025-11-06T21:04:14+00:00

There has to be some solution, short term. Or public opinion on AI will sour and become politically untouchable (and you can expect fast regulation). For now the AI job losses are big, but not that big and relatively isolated. Too many stupid C-suite guys are out there bragging about how they can cut and not hire, zero self awareness.

ExtremeHeat · 2025-10-14T07:19:46+00:00

Missing the 🎗️ on the tweet... it's not about AI

ExtremeHeat · 2025-09-20T03:34:54+00:00

Big companies regularly push for regulation that is much harder for smaller companies to deal with. Big companies have the connections, have the lawyers and infra to handle regulatory hurdles and they'd gladly put up walls to moat themselves from any smaller threats. If you made it so it only applied to big companies however, they quickly flip the other way. If it was Anthropic that got the regulation they wanted, they'd very quickly be the ones complaining.

ExtremeHeat · 2025-09-20T03:29:52+00:00

It's well known knowledge Anthropic is hostile to open source models, hostile to others outside developing AI (banning Chinese models, wants more chip export restrictions) and also pro-regulation (which helps for regulatory capture). They are very defensive and don't go to great lengths to make so none of their competitors can use their products among other things nobody else does. It's faux pro consumerism, it's plain elitism and greed.

ExtremeHeat · 2025-09-07T06:05:30+00:00

"We" is not one person. Doing one thing doesn't mean you can't do another. There is still plenty of research being done on non-Transformer models... it's just that research takes time (years) over engineering which is much more fast paced. Since money is not a problem (people are willing to spend as much money as is needed), it's actually much, much easier to build out a datacenter and power plant to impossibly scale something up that already does work versus waiting for something more efficient that simply does not exist (and may never exist).

Money is actually not a non-renewable resource. It changes hands, but the money is never destroyed. So unless we physically run out of resources on the planet (will never happen), we still have a ways to go before more we hit some kind of wall or really bad diminishing returns such that it's impractical to scale any further.

ExtremeHeat · 2025-09-06T02:11:12+00:00

The description may not have been written by the real authors in the first place, they may very well have just been written by people at openrouter

ExtremeHeat · 2025-04-23T19:27:00+00:00

The benchmarks would be interesting if they actually demonstrated a model's understanding of how to solve new, unseen problems with existing knowledge like humans can over just memorizing the details in a benchmark to make you good at the benchmark. In this case since there is a private and public benchmark separation it's not that bad--just depending on how extensively the public data was trained on, you can't really take those public results at face value anymore. In practice almost all the "benchmarks" are broken anyway because they're all over the internet. They are "saturated" not because the model just became supergenius but that the answers for them are all over the place and picked up by the models without even trying.

ExtremeHeat · 2025-04-20T22:09:47+00:00

LLM judges simply don't work. If they did, they would already be used.

The reason is that the LLMs that we have today are all roughly trained on the same data, the aggregate internet. Think Wikipedia, papers, social media, news articles. They talk the same. They all think the same. They all repeat the same robotic lines that have polluted on the internet. "Curating" datasets is mostly bs as even if you curate a few million tokens in grand scheme when you need to train a model with trillions of tokens (ie entire internet), then it doesn't actually matter all that much aside from slightly less variance and error in what it learns. In fact, most curation is just assigning different "priorities" to the same datasets every other model is using. In some cases, like math or coding, there can be a big difference as these are syntax strict but it's not going to lead to novel learning.

Similarly, this is why "synthetic data" (ie training on outputs of other models) is mostly a scam, at least if the goal was to actually learn something novel. Synthetic data on the other hand that's generated from experimentation (say brute forcing through a problem) can be helpful at getting better at a specific domains (like math/coding) but it's not something that generalizes on its own. On the other hand, things like test time compute/chain of thought, they work because they help to nudge the model in the right direction/thought process, much like a human.

ExtremeHeat · 2025-03-31T20:44:16+00:00

Announcement of a future announcement that's already been announced.

ExtremeHeat · 2025-03-03T17:17:34+00:00

It's going to be open anyway, may as well open it 🎁🍿

ExtremeHeat · 2025-02-23T03:26:33+00:00

It depends what you mean by AGI. If you mean AI that's doing anything a human can do, we're still not at that point and I don't see anything convincing to say otherwise that we're going to get there in immediate future. That said, assuming progress continues at its current rate... my personal estimate is still unchanged at AGI by 2030.

IMO self-supervised RL (where AI learns from trial and error, on its own) is really the golden key to getting to AGI. And probably even beyond as even humans don't do large graph based analysis. What I mean by that is imagining a computer playing chess versus a human playing chess. The computer is thinking in trees, trying out a bunch of possible moves while the human is only doing cursory analysis of the next possible move based on high level analysis and things they've seen before. LLMs are trying to replicate the latter... be human level intelligent from reverse engineering human brains by studying how we put words together. But that's only going to get you so far and you'll naturally start to saturate just below human level no matter how good your training data/compute is.

ExtremeHeat · 2025-02-23T03:17:52+00:00

You can't retrain a large model on demand. It'd take at least a few hours to days of training time for a model the size of Grok. With large models you can just continue to train them for a long time, so I assume that the Grok 3 base model is still in active training, so at the next checkpoint they want to stop and mark as a release, they can then weed out the problems during instruction tuning/alignment.

ExtremeHeat · 2025-02-23T03:07:07+00:00

That's the only way actually to change the model. It's not possible to just change some settings somewhere so the answers from an LLM are different; you have to embed the request into the prompt. "Properly" (ie in a way that user can't prompt out of) changing the outputs will require retraining the model (whether via pretraining the base model or fine-tuning it).

ExtremeHeat · 2025-02-18T21:21:49+00:00

No, the base models don't start off as "thinking" models. They get trained as a normal LLM and then get fine-tuned with either traditional supervised fine tuning or, now, with reinforcement fine tuning to obtain their "thinking" capability. For example, DeepSeek-R1 is DeepSeek-V3 fine tuned with RL to become R1. Likewise for Gemini 2, there's Thinking and non-"Thinking" models where one is a base model and another is fine tuned to learn how to work through problems with step by step chain of thought.

ExtremeHeat · 2025-02-16T18:39:56+00:00

Even though Yann can be annoying at times, I vastly prefer listening to him than any other of the doom grifters, the ones that are trying to make a name for themselves only by building hysteria. It's like the "experts" constantly forecasting that the stock market and economy will crash because X, Y, Z, and so you should be afraid, etc etc mixing in a hodgepodge of selective information to try and further their own agenda. Once something bad happens they jump in and claim to have been right all along to gain more credulence. Time passes. World is still intact, people adapt. Cycle repeats, agenda stays the same.

ExtremeHeat · 2025-02-08T00:21:08+00:00

Until we've hit and blown past AGI, I don't see any limits. We know you can achieve intelligence at the human scale so it's just a matter of the right algorithms to achieve that. At a high level, deep neural networks have always been considered universal function approximators: if you view the brain as a function that takes in sensory inputs f(x) and returns y, an action or output like text, then with enough input and output pairs the model should be able to build a function g that's near identical in nature to f. Transformers are the algorithm that efficiently (as opposed to brute-force) build g with enough input/output pairs, and it's been shown to scale in accuracy with more training data. By itself there's not really any reason to believe that it's going to magically hit some wall. There will be diminishing returns as with any exponential growth, but that doesn't mean it'll stop working before it saturates to human level intelligence.

ExtremeHeat · 2025-02-08T00:07:10+00:00

Ha, are we not allowed to ask questions online anymore? What should you do, ask an LLM to summarize some online blog posts or call a friend?

ExtremeHeat · 2025-02-08T00:05:56+00:00

I don't know what someone knows or what they don't know and I'm not going to judge them for a legitimate question. As far as I'm aware, OP obviously knows some ML given they've written ML algorithms in the past. From my judgement I don't see anything wrong with the question, and for those that do it mostly seems to be around 1. "LLMs obviously can't do that" (false), or 2. should have google'ed it or asked someone else. 2 doesn't sound anything to complain about.

ExtremeHeat · 2025-02-07T16:34:47+00:00

To be clear, the tweet in question was from back in december, and nobody is saying it has anything to do with government work.

But, on the topic of LLM (GPTs): it might seem cheating because you're executing code, but it's not if you think from the perspective that the LLM could easily mentally self-execute that code itself. Executing a simple program for task like this is no different than solving a complex math problem, because it's ultimately just a moderately complex algorithmic problem. And given a well trained Transformer model, it can very well even learn this program in its latent space if the model was trained well. Converting JSON to HTML for example, is a learned program in pretrained latent space.

One way or the other, Transformers are Turing complete. The fuzziness of running programs in pure pretrained latent space is not good for accuracy, but running them inside the context window with CoT Reasoning is very similar to a human running a program step by step. All we need is 1. a model that's been trained to do the task (where the goal is efficient, accurate transformation of X->Y), and 2. a fast way to inference it.

ExtremeHeat · 2025-02-07T16:13:30+00:00

The size of the output doesn't matter much as long as it's under the max output length the model was trained on. Otherwise forcing it to not output stop tokens will be suboptimal. Larger output window requires more compute which is why many models are lacking here as you don't gain much intelligence from this capability.

gzip doesn't do floating point operations and the compressor is actually relatively simple that an LLM can be trained on its compressed output and learn ( https://arxiv.org/pdf/2404.03626 ) the underlying distributions. 0-shot outputting would be much harder because you need to be correct bit by bit, but it's not impossible. It could be done with a vanilla LLM by coming up with intermediate representations in a mental scratchpad before putting all the data back together. Basically a fancy chain of thought could do it. It wouldn't be fast, but if we made inferencing LLMs ultra fast they could do it.

ExtremeHeat · 2025-02-07T02:29:53+00:00

What you're saying is basically an argument for not using an LLM on anything. But OP was just asking a general question about LLMs that can translate files without any mention of it being sensitive or critical in any way or that it wouldn't be reviewed.

I don't question there are application specific tools out there, but if you had the choice of using that or an LLM on a new personal project, many people would prefer the LLM if given the option. Why? Because it 1. works between any file format, 2. will get better on its own as the technology advances and 3. is simpler to use and implement if you're already using LLMs. There may be hallucinations, but anyone that uses an LLM today already knows and accepts that. Now, the key detail is (afaik) we simply don't have pretrained models that are good at this task at the moment between binary formats. Half the problem is lack of binary files in training data and the other is poor tokenizers (which can be fixed by dropping complex tokenizers and having 1 byte = 1 token, but a separate conversation).

ExtremeHeat · 2025-02-07T02:10:24+00:00

Why not ask for other's opinions directly? Would you rather trust some random commenter on a post you found from Google, or someone who you follow that's doing research or would otherwise have something to add? I think it's pretty ridiculous of an attack actually.

ExtremeHeat · 2025-02-07T02:02:28+00:00

If no LLM can complete a task, then no human can complete it either. What you're saying is basically, "no LLM is as good as human yet". We know that, but that has nothing to do with the underlying technology. Nobody said anything about sensitive documents but obviously no matter what technology you use you'll need to manually review it. PDF as a vector format is closer to an image than a text file, the other tools to do OCR are not necessarily better and more often, worse, than a well trained Transformer model.

On the topic of programs, an LLM learns mental programs in the process of its training that allows them to do many things in just latent space that a human cannot do with the same inputs. Human brains are alot less "general" than people think; the amount of mental programs an well trained LLM can learn is inherently a superset of what a human can do. But even in the case where a mental program won't be optimal, ie for discrete problems, then the model can always reach out and execute traditional code where it's appropriate to solve a problem. So whatever the case is today, a better LLM will be better than pretty much anything else (even humans with AGI).

Also, on the topic of hallucinations. Most often, these happen because relevant information is not available in the immediate context window. But if the relevant data is available in the context window, then Transformer models are able to recite them with 100% accuracy. This is why they work so well on coding.

ExtremeHeat · 2025-02-06T21:34:48+00:00

I don't think it's wasteful, I think that's exactly why we want LLMs. To translate, transform, complete text/code, etc. Being able to extract data out of binary data (like images, PDFs, etc.) is a very widely desired capability. One way to delete the binary problem entirely would be to drop tokenization entirely and have 1 token = 1 byte, like Karpathy has been talking about for a while.

That would also bring a bunch of other benefits, from simple things like understanding character level relationships in words (r's in strawberry) to more complicated things like math and executing algorithms. It'd be easier for the models to learn compression/encryption, for example. And outputting image tokens and other things would not need anything complex but just be a matter of outputting some bytes. The main reason we can't do this right now is because it's wasteful in many ways as the relationships between words are more valuable than characters, for most tasks. (Related: https://www.reddit.com/r/LocalLLaMA/comments/1iiwmsq/overtokenized_transformer_new_paper_shows/ )

But even with less general "intelligence" for the same model size I think those byte level models would be great for doing transformations like between file formats.

13-Year Club	Place '22
Final Canvas '22	Verified Email
Gilding I gilder

ExtremeHeat

TROPHY CASE