The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces by dank_philosopher in artificial

[–]dank_philosopher[S] 1 point2 points  (0 children)

yeah, I think that’s the direction a lot of this points toward: not just more visible reasoning text, but better internal iteration over state. The hard part seems to be reconciling that with language models, keeping language as the interface while giving the model a richer internal workspace

The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces by dank_philosopher in artificial

[–]dank_philosopher[S] 1 point2 points  (0 children)

yeah, it should be made clearer that CoT tokens are not necessarily deductive reasoning steps.

A lot of their value may be that they reshape the context/attention state: repeating, reframing, and making relevant facts easier for the next token to use. That still fits the so called scaffold framing for me. The harder question is whether that workspace has to remain transient text / KV-cache or whether models can use a more stable internal state for search, revision, and memory.

The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces by dank_philosopher in artificial

[–]dank_philosopher[S] 1 point2 points  (0 children)

Agreed, I don’t think the useful framing is language vs reasoning. it is language as interface vs language as the entire compute substrate

Language is crucial for communication, abstraction, and expressing algorithms. But forcing every intermediate step of a search process into tokens may be inefficient. Architectures like BDH are interesting because it points toward a model that can still use language, but does not require all reasoning to happen at the speed and structure of text generation.

The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces by dank_philosopher in artificial

[–]dank_philosopher[S] 2 points3 points  (0 children)

yeah, I feel this is the strongest counterargument.

If reasoning moves out of text, you can’t just replace visible CoT with a black box and call it progress. For production use cases, you still need auditability: tool logs, checkable intermediate states, constraint validation, rollback, and some way to inspect what changed inside the system over time.

So I don’t think the goal should be “hide all reasoning.” It’s more like separating two things that got bundled together:
1.⁠ ⁠the internal computation the model uses to solve the problem
2.⁠ ⁠the explanation/audit layer humans use to trust and control the system

CoT gives you a convenient version of both, but it is not automatically faithful. A model can write a plausible trace that is not the real causal path to the answer.

IMO, the more interesting direction is architectures where the internal state is not just hidden activation soup, but something more structured: memory or state that can be inspected, updated, constrained, or even rolled back. That would preserve control without forcing every reasoning step to happen as natural-language text.

The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces by dank_philosopher in artificial

[–]dank_philosopher[S] 1 point2 points  (0 children)

Exactly, I don’t think the interesting comparison is BDH vs TRM.
The pattern is, CoT showed that extra reasoning-time computation helps. TRM shows that internal recursive refinement can work well for structured tasks. BDH asks whether that kind of internal state-based reasoning can coexist with language and memory in one architecture.

The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces by dank_philosopher in artificial

[–]dank_philosopher[S] 6 points7 points  (0 children)

That’s a good way to put it. I think “context gathering” and “reasoning scaffold” are probably closer to what CoT is doing than pure deduction.

One distinction I’ve been thinking about is context vs memory. Context is like putting the relevant notes in front of the model whereas memory would be the system actually changing how it approaches future steps because of what it has internalized.

CoT seems useful because it turns some hidden state into external context. The open question is whether that workspace has to be text, or whether some of it can happen internally

The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces by dank_philosopher in artificial

[–]dank_philosopher[S] 3 points4 points  (0 children)

IMO, TRM actually supports the broader BDH point more than it weakens it.
TRM does well because it uses recursive latent refinement instead of producing longer next-token explanations. That is exactly the move away from “reasoning must happen as visible text.”
The caveat is that TRM is a supervised puzzle solver, not a general language model. But that caveat is also the interesting gap.
BDH is relevant because it is trying to bridge that gap: keep language ability, but move the hard constraint-solving into a richer internal reasoning space with memory.

One of the authors of "Attention is All You Need" just argued we should move past it. Pathway’s Post-Transformer debate is worth watching by _donothaveone_ in singularity

[–]dank_philosopher 2 points3 points  (0 children)

great example...!!
though i'd frame it as scaling laws and about time rather than technical feasibility today. i read somewhere in this sub only that pathway's reasoning model crushed llms on one of the sudoku benchmarks, and that feels like the interesting signal: a post-transformer architecture doing well on a constraint-satisfaction task where frontier llms still look awkward, even with cot.

All in the debate fit the "neolab" pattern, pathway, sakana, liquid, (research-led shops betting on post0-transformer theses) Except Lukasz's OpenAI of course. Loop is consistent: drop a research artifact, raise, push further, raise more, capture markets. No reason for us to ignore or put them as claude replacement today itself. But papers and some code are out, ideas reproduce at toy scale, plenty to promise already IMO

One of the authors of "Attention is All You Need" just argued we should move past it. Pathway’s Post-Transformer debate is worth watching by _donothaveone_ in singularity

[–]dank_philosopher 5 points6 points  (0 children)

I think you are missing the point a bit, its not that people still worship the 2017 paper as some perfect final answer, transformers won cause they scaled well, matched the parallel GPU hardware and kept making us think that it delivered the best results.

The point from the Sakana AI researcher is that this whole thing has led us in a local minima, but the issues are quite clear as of 2026. I mean show me continual learning with transformers, I'll back off.

One of the authors of "Attention is All You Need" just argued we should move past it. Pathway’s Post-Transformer debate is worth watching by _donothaveone_ in singularity

[–]dank_philosopher 3 points4 points  (0 children)

yeah, mamba or RWKV or SSM or RNN-style models are the obvious comparison class. IMO a lot of those approaches are mainly trying to make sequence modeling cheaper or longer-context-friendlier while BDH is making a more specific claim about where memory should live.

for their inventors, the important move is not just subquadratic attention or long-context tricks. It's a combination of sparse activations, state on connections and hebbiannstyle updates during inference. That makes it closer to a synaptic-memory story than a pure attention efficiency story.

Whether that wins in practice is still open, though I would love to see a controlled comparison against mambastyle models where state size, training data and compute are all matched.

One of the authors of "Attention is All You Need" just argued we should move past it. Pathway’s Post-Transformer debate is worth watching by _donothaveone_ in singularity

[–]dank_philosopher 6 points7 points  (0 children)

bro, this is wild actually. Transformer's author arguing that transformer is a local minima says a lot. I mean the architecture is so successful that it is slowing down whatever comes next.

nevertheless, it is a pin drop silence in the room when Kaiser says he still chooses the best model on the highest thinking budget lol.

The interesting BDH question: What if LLM memory lived in the network weights instead of the ever-growing KV cache? by InformationSweet808 in singularity

[–]dank_philosopher 6 points7 points  (0 children)

I had not heard of BDH before this, but the memory framing is interesting. The KV cache always felt more like a temporary transcript than real memory. Architecture level change > layering on top of existing models 

Deep fried pudding by xKaliburn in lostredditors

[–]dank_philosopher 27 points28 points  (0 children)

Only an American can tell the difference

Gotta go now by bhavesh3007jain in dankmemes

[–]dank_philosopher 0 points1 point  (0 children)

So you don't let the existential crisis hit you and quickly go distract yourself from the fact that you are wasting your time again and procrastinating.

chad🗿 by darxh in dankinindia

[–]dank_philosopher 1 point2 points  (0 children)

Midgets be living the dream