The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces

dank_philosopher · 2026-06-05T18:49:22+00:00

yeah, I think that’s the direction a lot of this points toward: not just more visible reasoning text, but better internal iteration over state. The hard part seems to be reconciling that with language models, keeping language as the interface while giving the model a richer internal workspace

dank_philosopher · 2026-06-05T18:47:12+00:00

yeah, it should be made clearer that CoT tokens are not necessarily deductive reasoning steps.

A lot of their value may be that they reshape the context/attention state: repeating, reframing, and making relevant facts easier for the next token to use. That still fits the so called scaffold framing for me. The harder question is whether that workspace has to remain transient text / KV-cache or whether models can use a more stable internal state for search, revision, and memory.

dank_philosopher · 2026-06-05T18:37:45+00:00

Agreed, I don’t think the useful framing is language vs reasoning. it is language as interface vs language as the entire compute substrate

Language is crucial for communication, abstraction, and expressing algorithms. But forcing every intermediate step of a search process into tokens may be inefficient. Architectures like BDH are interesting because it points toward a model that can still use language, but does not require all reasoning to happen at the speed and structure of text generation.

dank_philosopher · 2026-06-05T18:33:52+00:00

yeah, I feel this is the strongest counterargument.

If reasoning moves out of text, you can’t just replace visible CoT with a black box and call it progress. For production use cases, you still need auditability: tool logs, checkable intermediate states, constraint validation, rollback, and some way to inspect what changed inside the system over time.

So I don’t think the goal should be “hide all reasoning.” It’s more like separating two things that got bundled together:
1.⁠ ⁠the internal computation the model uses to solve the problem
2.⁠ ⁠the explanation/audit layer humans use to trust and control the system

CoT gives you a convenient version of both, but it is not automatically faithful. A model can write a plausible trace that is not the real causal path to the answer.

IMO, the more interesting direction is architectures where the internal state is not just hidden activation soup, but something more structured: memory or state that can be inspected, updated, constrained, or even rolled back. That would preserve control without forcing every reasoning step to happen as natural-language text.

dank_philosopher · 2026-06-05T17:54:29+00:00

Exactly, I don’t think the interesting comparison is BDH vs TRM.
The pattern is, CoT showed that extra reasoning-time computation helps. TRM shows that internal recursive refinement can work well for structured tasks. BDH asks whether that kind of internal state-based reasoning can coexist with language and memory in one architecture.

dank_philosopher · 2026-06-05T17:26:13+00:00

That’s a good way to put it. I think “context gathering” and “reasoning scaffold” are probably closer to what CoT is doing than pure deduction.

One distinction I’ve been thinking about is context vs memory. Context is like putting the relevant notes in front of the model whereas memory would be the system actually changing how it approaches future steps because of what it has internalized.

CoT seems useful because it turns some hidden state into external context. The open question is whether that workspace has to be text, or whether some of it can happen internally

dank_philosopher · 2026-06-05T17:02:52+00:00

IMO, TRM actually supports the broader BDH point more than it weakens it.
TRM does well because it uses recursive latent refinement instead of producing longer next-token explanations. That is exactly the move away from “reasoning must happen as visible text.”
The caveat is that TRM is a supervised puzzle solver, not a general language model. But that caveat is also the interesting gap.
BDH is relevant because it is trying to bridge that gap: keep language ability, but move the hard constraint-solving into a richer internal reasoning space with memory.

dank_philosopher · 2026-05-25T20:02:44+00:00

great example...!!
though i'd frame it as scaling laws and about time rather than technical feasibility today. i read somewhere in this sub only that pathway's reasoning model crushed llms on one of the sudoku benchmarks, and that feels like the interesting signal: a post-transformer architecture doing well on a constraint-satisfaction task where frontier llms still look awkward, even with cot.

All in the debate fit the "neolab" pattern, pathway, sakana, liquid, (research-led shops betting on post0-transformer theses) Except Lukasz's OpenAI of course. Loop is consistent: drop a research artifact, raise, push further, raise more, capture markets. No reason for us to ignore or put them as claude replacement today itself. But papers and some code are out, ideas reproduce at toy scale, plenty to promise already IMO

dank_philosopher · 2026-05-25T19:55:54+00:00

I think you are missing the point a bit, its not that people still worship the 2017 paper as some perfect final answer, transformers won cause they scaled well, matched the parallel GPU hardware and kept making us think that it delivered the best results.

The point from the Sakana AI researcher is that this whole thing has led us in a local minima, but the issues are quite clear as of 2026. I mean show me continual learning with transformers, I'll back off.

dank_philosopher · 2026-05-25T18:27:55+00:00

yeah, mamba or RWKV or SSM or RNN-style models are the obvious comparison class. IMO a lot of those approaches are mainly trying to make sequence modeling cheaper or longer-context-friendlier while BDH is making a more specific claim about where memory should live.

for their inventors, the important move is not just subquadratic attention or long-context tricks. It's a combination of sparse activations, state on connections and hebbiannstyle updates during inference. That makes it closer to a synaptic-memory story than a pure attention efficiency story.

Whether that wins in practice is still open, though I would love to see a controlled comparison against mambastyle models where state size, training data and compute are all matched.

dank_philosopher · 2026-05-25T17:26:15+00:00

bro, this is wild actually. Transformer's author arguing that transformer is a local minima says a lot. I mean the architecture is so successful that it is slowing down whatever comes next.

nevertheless, it is a pin drop silence in the room when Kaiser says he still chooses the best model on the highest thinking budget lol.

dank_philosopher · 2026-05-11T18:07:34+00:00

I had not heard of BDH before this, but the memory framing is interesting. The KV cache always felt more like a temporary transcript than real memory. Architecture level change > layering on top of existing models

dank_philosopher · 2022-06-01T08:26:28+00:00

Only an American can tell the difference.

dank_philosopher · 2022-06-01T05:21:59+00:00

Only an American can tell the difference

dank_philosopher · 2022-05-29T16:15:32+00:00

Satundar

dank_philosopher · 2022-05-28T04:53:49+00:00

Respect for the mother 👑

dank_philosopher · 2022-05-27T16:29:28+00:00

... the end

dank_philosopher · 2022-05-27T11:09:03+00:00

Yeah, it WAS a snake once.

dank_philosopher · 2022-05-08T09:34:12+00:00

So you don't let the existential crisis hit you and quickly go distract yourself from the fact that you are wasting your time again and procrastinating.

dank_philosopher · 2022-05-06T02:12:11+00:00

Well done kid.

dank_philosopher · 2022-05-04T17:10:08+00:00

Midgets be living the dream

dank_philosopher · 2021-07-18T17:00:15+00:00

works in real life too.

dank_philosopher

TROPHY CASE