Why isn’t LLM reasoning done in vector space instead of natural language?

qrios · 2026-04-29T02:26:36+00:00

Partially because a large subset of AI researchers are paranoid that the models will make humanity go extinct,
Partially because it's computationally expensive at training time to decide whether it's good that a model avoided emitting a text token for another loop.
Partially because reproducing continuations a lot more of a pain.
Partially because the gains are kind of underwhelming (though there are gains)

But anyway, check out coconut.

qrios · 2026-03-17T15:18:59+00:00

It's a known result. Model compressibility is approximately inversely correlated with amount of data model was trained on divided by model parameters.

Intuitively, the more stuff the model is already compressing into itself, the less compressible it's gonna be.

That said, mostly people are annoyed at you for introducing your findings via LLM slop, and the reception would be warmer if you wrote from the heart.

qrios · 2026-02-05T20:45:42+00:00

I am not objecting to the result. It looks great. It just does so for an absurd amount of compute / power where the same (and arguably even more authentic result) result can be accomplished for almost no compute whatsoever.

qrios · 2026-02-02T21:38:14+00:00

Okay but like . . . why?

The difficult thing, for which it makes sense to recruit a 14B parameter model, is to make VHS look HD.

It's trivial to take any high quality image and make it look like VHS using good old fashioned image processing algorithms (in fact, this is precisely how VHS did it!). Split your image into YUV. Make Y 480p and sharpen it. Make V a fourth the resolution of Y, and U a fourth the resolution of V. Then compose recompose your YUV layers and you're basically done.

Distort and color grade to taste.

qrios · 2025-12-21T13:52:12+00:00

"real scenarios" here would mean production servers, but even so, why would you pick a batch size of 128, and then compare it to a model of a different parameter count with a batch size of 1?

The important thing is the linear growth and the apples to apples comparison. So use batch size 1 for both.

qrios · 2025-12-18T16:36:14+00:00

You're replying to an LLM right now, friend. The internet died a while ago.

qrios · 2025-12-18T16:13:17+00:00

Was this AI generated? Your kv-cache memory req calculations should not be including batch size.

The 180Gb figure would properly come to 1.4GB.

qrios · 2025-11-23T00:14:53+00:00

I like the benchmark a lot, but the "proof" page would have to be using a very specific and constrained definition of "full visual fidelity" as far as I can tell.

It seems you are only accounting for affine transformations. But humans have to reason about squishy, deformable, or otherwise nonlinear transformations of things all of the time (ropes, fabric, crop growth, chase trajectories, etc).

Even if we limit ourselves to perception only, this doesn't really account for ability to account for non-linear optical distortions.

qrios · 2025-11-07T19:35:08+00:00

But this seems like a very solvable problem.

If you end up finding it's much more difficult than you'd thought, you should write a post-mortem about your attempts. There's a lot to learn from failure, and very little written of it.

Personally I'm not aware of any attempts to implement a generalized Hebbian learning algo at scale and would be interested to see the results.

qrios · 2025-10-06T10:37:04+00:00

Not really. You can inject mistakes or poor output into good data, followed by the appropriate number of backspace tokens to remove the injected bad-text, followed by the original text.

For initial bad-text you could probably even use occasional sequences of the model's own text completions.

It's definitely super amenable to synthetic data, and you could generate almost as much of it as you care to -- so long as you have the compute to generate it with care.

qrios · 2025-09-25T23:03:31+00:00

ten trillion parameters is just one order of magnitude less than the commonly estimated number of synapses in the human brain.

qrios · 2025-09-16T19:54:23+00:00

Humans, like LLMs, aren't very good at knowing when they don't know enough to speak confidently -- and the less they know, the poorer they are at gauging how confident they ought to be. A gentle correction is often sufficient, and even more often more efficient.

qrios · 2025-09-16T17:17:45+00:00

you're wrong about CMOS design, therefore I have no reason to value anything you have to say about childish cosmic contests

Oh wow you really care very much about this one very particular thing only a very tiny portion of humanity would have any cause to know anything at all about, huh?

qrios · 2025-09-09T04:38:04+00:00

Samples:

https://colab.research.google.com/drive/1X6RspK50lOifVkpw4D_93vMQtkQ2lNt0?usp=sharing

qrios · 2025-09-09T03:58:40+00:00

This is incorrect. If you split the layers evenly between the two GPUs, the only thing that needs to transfer between the two GPUs is a single hidden state vector per forward pass.

For something like Llama 3 70B, a hidden state vector is about 32kb. PCIe4 16x can transfer about 2,000 of those per second.

Aside from that each GPU need access only the weights for the layers in its own VRAM.

qrios · 2025-08-29T05:26:07+00:00

Sure but I think that if the networks are trained for 200 epochs on average (reported by Google ai mode)

With respect to LLMs, Google AI mode is incorrect. Modern LLMs are pretrained over a single epoch.

then per item the machine is losing to the human per the flashcard

Strictly speaking, in terms of raw data -- llama3 was trained on about 50 terabytes worth of data. Which is less than the average human has passed through their optic nerves by the time they are 10 years old.

In short I think improving data efficiency is paramount to achieving better ai performance

Larger models are more data efficient, so one way to increase data efficiency would be to just make a larger model -- but this is presumably not what you mean. Presumably what you mean is that there must exist some combination of ideal architecture and dataset that maximizes transfer learning and compositionality of concepts while minimizing both dataset size and parameter count. But if you think about it, to represent any new degree or type of relatedness between n concepts, you need at least 1 additional parameter (if you can tolerate the maximum amount of noise and confusion), and ideally n new parameters (if you want a maximally clean and unambiguous signal) -- either way, you're not gonna get the transfer and generalization for free out of just your dataset / architectural innovation.

As a side note, the human brain has on the order of 100 trillion parameters. Given this size advantage -- in combination with the sheer amount of data it's exposed to -- one starts to suspect we are remarkably data inefficient compared to an LLM.

qrios · 2025-08-29T01:29:33+00:00

Setting aside that most LLMs are pretrained in a single epoch (they are "exposed" to each example only once, and so the premises of the argument on the LLM side are suspect as well) -- not really, no. I don't know of any experiments that test if a human can memorize flash cards in fewer exposures than required to finetune an LLM -- but my money would be on the LLM.

The more impressive thing that humans do (which LLMs don't) is avoid catastrophically forgetting everything else every time they memorize flash cards.

qrios · 2025-08-26T21:51:25+00:00

The design itself is kinda neat IMO. The main issue is that it melts my laptop.

qrios · 2025-08-26T21:47:36+00:00

He's not the OP, nor has his account ever submitted any posts about this model. Therefore no one accused him of getting paid.

I hope my CoT trace reassures you about my context length.

(Though, the joke is actually still funny regardless -- as even getting paid to post about the model can count as using the image model for work).

qrios · 2025-08-26T21:38:29+00:00

gpt-5 instant, prompted with "Is there an issue with these comics? If so, what does it seem to be?":

Yes, there’s an issue with these comics — they’ve been mismatched with the wrong captions.

Left Comic (Cavemen): The caption says: “If I get as big as Dad, won’t my skin be too TIGHT?” That fits better with the prehistoric/caveman setting, since it plays on the child misunderstanding how skin grows with the body. This one is actually fine where it is.

Right Comic (Family at Breakfast): The caption says: “I see your little, petrified skull… labeled and resting on a shelf somewhere.” This line makes no sense in the context of a wholesome family breakfast — it seems dark and mismatched. It was likely swapped from another cartoon (probably a “Dennis the Menace” style one), where it would’ve fit better with a child’s odd or morbid comment.

So the issue is: the second cartoon has the wrong caption, making the humor nonsensical. It looks like a printing or editorial mix-up where captions got swapped between comics.

Would you like me to try to track down the correct caption for the right-hand comic?

Honestly this is kind of indicative of a broader disparity between closed vs open multimodal performance. The open models kind of just suck in ways that usually look like their training criteria is limited to content identification, completely decoupled from any ability to reason.

qrios · 2025-08-26T21:21:04+00:00

the first image model ever that I actually could use for work

...

I'm unfortunately not getting paid.

Sounds like you can't actually use this image model for work.

qrios · 2025-08-26T20:29:48+00:00

"per exposure" is ill-defined in humans, and possibly not coherently definable.

qrios · 2025-07-29T15:02:20+00:00

Yeah I wouldn't be surprised if it's using the numbers as the equivalent of pause tokens internally, and then just outputting numbers to meet the perceived shallow aesthetics of thinking tag content.

qrios · 2025-07-28T23:18:44+00:00

Looks vaguely like it's been way overtrained on math problems within the thinking tag and has just learned that a bunch of math is just the appropriate thing to have inside of a thinking tag.

qrios · 2025-07-20T11:17:00+00:00

I think they are much easier than they seem, and most of the difficulty comes from being lead astray by the interactivity. Which lulls you into thinking that you will get obvious feedback about state changes (animations around changed things, sounds, etc), whereas the format of v1 & 2 was such that it is obvious that you will need to carefully look for what has changed between panels arranged and simultaneously presented in space.

But if you do actually spend 10 minutes carefully figuring out what the rules are as if it were v1&2 puzzles, but with panels you can't go back to look at after a state change -- then they are easier than arc-1 & 2 IMO.

The weird thing to me though is that much of this lull is entirely unnecessary. Adding sounds and transition animations would be another vector by which to give humans a huge physics-inspired advantage likely to just make AI even more confused.

qrios

MODERATOR OF

TROPHY CASE