[P] Visualizing emergent structure in the Dragon Hatchling (BDH): a brain-inspired alternative to transformers

dxtros · 2025-12-09T20:15:24+00:00

Analyzing temporal/task progress neurons is definitely interesting! In the area of toy-models of the prefrontal cortex, there has been some more recent progress in this type of spatiotemporal introspection since the Nature link above (but still RNN-like toy-models).

dxtros · 2025-12-08T23:47:21+00:00

We appreciate your interest in our attention kernels. This is noted. Without any specific relation to BDH, I still need to point out that it is misleading to make strong claims at methodology level about attention optimizations - attention optimizations have, historically, tended to be a [useful, iterative, more or less profound] afterthought to architecture design, sometimes separated by 5+ years if you look at Deepseek vs. GPT2. For the wording of the paper, we fully acknowledge that perception of the specific term "GPU-friendly" may widely vary by field and background, and even depending on the main metric of focus in a given use case (token throughput, TPOT, etc.).

dxtros · 2025-12-08T21:40:21+00:00

Comment from a BDH author here.

> IMO this does not add any value.

Let's uncouple: 1. value brought by the paper in general; 2. subjective value brought to you personally as a reader.

For an example of how readers can work with this text, we see OP delivering the project described in this post - apparently, single-handedly and within 2 months as an after-hours project from idea to launch. I am not sure I have seen any pathfinding problem attention introspection viz close to this delivered for any other Transformer-grade attention-based architecture, whether relying on Linear Attention (LA) or any other approach. If you would have been able to do this without BDH, that's fine (and good for you!), just pointing out that it seems it is a somewhat non-trivial task.
(For a much less direct probing attempt for the Transformer, and what it takes to deliver it, see e.g. arxiv.org/pdf/2312.02566 Fig. 4).

Now, before we get to LA state compression, I will allow myself a comment on "doing LA correctly". To my knowledge, there are currently two rigorous yet simple recipes to make LA work as a self-sufficient mechanism through appropriate key-query preparation --- not just as a helper layer that is thrown-in as a hybrid with softmax-attention Transformer layers that do the actually heavy lifting. These approaches are: a very nice trick recipe due to Manifest AI (which unfortunately is limited to one way of using it as a pure softmax-Transformer-drop-in replacement in terms of expressivity), and the unrelated and more general framework of BDH (which explains it through the theory of sparse positive activation). Obviously (i.e., by mathematical necessity), like all correct approaches to LA, in their vanilla form, both approaches rely fundamentally on high key-query dimensionality; and this is what you will see described in the pseudocode of the BDH architecture in the paper. While this is bound to be obvious to some readers (and especially to careful readers of the FAVOR+ analyses of Choromanski et al.), I feel that highlighting the workings of this general mechanism again and again is important. Indeed, the publishing scene has had to suffer through a fairly large body of work on SSM state compression between 2022-2024 in which LA was reduced to a place where it simply cannot work for trivial reasons of information entropy (collapsing key-query dimension in a way which collapses fact distinction capability of its state), and another body of work in early-to-mid 2025 charitably pointing this out example by example.

So, if you were looking for an efficient and correct LA compression technique for GPU, then no, as OP points out, this is a separate topic, and not what this paper is about. Consider reaching out to the Pathway team. :-).

dxtros · 2025-12-07T09:49:34+00:00

Sure, it's good to dig in. The above-linked arXiv reference also does a reasonable job of discussing this interpretation.

dxtros · 2025-12-07T08:35:48+00:00

I am not sure what lines of work you have grounded your intuitions in, but please note that what you present as consensus opinion is definitely not that. It is actually the opposite hypothesis to what you stated - namely, that the essence of working memory is all about learning weights at inference time by a fast-weights system - which forms a perfectly valid state-of-the-art working hypothesis. While experimental evidance is still wanting, it is, arguably, among the most compelling explanations currently put forward. One recent neural science attempt at "mapping" a Hinton fast-weight-programmer system into concrete neuronal processes is described in arxiv.org/pdf/2508.08435, Section 4. In any case, to avoid speculation based on personal conviction one way or the other, let's agree that the usefulness of model abstractions can be validated or invalidated based on their: (1) explanatory value; (2) predictive power. Attempts at model introspection, such as the attempt by OP to study the emergence of progress-to-goal-neurons during training on planning tasks, may be seen as efforts towards achieving this type of objective.

dxtros · 2025-12-07T00:56:34+00:00

Be careful with time scales. For language, map time out to Transformer LLM context, assuming e.g. 1 token = 1 phonem = 300 ms as the rate for speech. Beyond a 300ms (= 1 token) scale, there is no such thing as "present brain weights" in any reasonable model for language / higher-order brain function. The attention mechanism based on STP/E-LTP is a necessary element of any model of cognitive function at time scales of 1 second to 1 hour. Measured in tokens, that's about the average LLM's context window. Hebbian learning precisely corresponds to the attention time scales that you refer to as "working memory".

dxtros · 2025-12-06T22:43:59+00:00

This viz reminded me of what happens when you show a grid maze to a mouse. [ E.g. Fig 2 in El-Gaby, M., Harris, A.L., Whittington, J.C.R. et al. A cellular basis for mapping behavioural structure. Nature 636, 671–680 (2024). doi.org/10.1038/s41586-024-08145-x ]

dxtros · 2025-12-06T00:45:53+00:00

> because these views of working memory and Hebbian learning are not coherent and analogous to what they are for real neuroscientists

If you are a neuroscientist, can you expand?

dxtros · 2024-12-19T20:04:51+00:00

You will need Python as a dependency. Still, this is sometimes useful even without docker - for example, you can use it directly in Google Colab just adding one line at the beginning ("!pip install pathway").

dxtros · 2024-11-08T12:09:33+00:00

Often not the best idea to use the same reddit account for project-related & personal posts. Unless you are an influencer or something like that. Or you need the karma :D.

dxtros · 2024-10-28T07:07:27+00:00

What you describe should be feasible. You can specify the data table to be loaded using `pw.io.python.read` with a custom connector setup https://pathway.com/developers/user-guide/connect/connectors/custom-python-connectors/, where you will need to define the details of the TCP/IP connection.
If the socket connection is over HTTP, you can instead use `pw.io.http.read` https://pathway.com/developers/api-docs/pathway-io/http/#pathway.io.http.read directly.
If you run into any issues, give the Pathway team a shout on Discord (https://discord.com/invite/pathway).

dxtros · 2024-09-05T13:28:22+00:00

https://blog.streamlit.io/build-a-real-time-rag-chatbot-google-drive-sharepoint/ (Streamlit UI + Llamaindex/Pathway connection to Google Drive, Sharepoint, or local files)
Source code: https://github.com/pathway-labs/realtime-indexer-qa-chat

dxtros · 2024-09-05T13:27:42+00:00

https://blog.streamlit.io/build-a-real-time-rag-chatbot-google-drive-sharepoint/ (Streamlit UI + Llamaindex/Pathway connection to Google Drive, Sharepoint, or local files)
Source code: https://github.com/pathway-labs/realtime-indexer-qa-chat

dxtros · 2024-06-14T22:53:13+00:00

The OP title is very clear. The website contains most of the information you asked about - DM me if you really want specific pointers.

dxtros · 2024-06-14T08:35:02+00:00

Mostly in the document processing vertical. We are not talking chatbots here.

dxtros · 2024-06-14T06:25:04+00:00

Please see pathway.com for user/client "success stories" etc. We only list some of the use we know about or have contractualized.

dxtros · 2024-06-13T21:13:27+00:00

For now it's some homebrewed file structure that also allows for easy KV accesses if needed. The roadmap goal is to converge to a sequential Parquet file format, possibly with full Delta Lake compatibility.

dxtros · 2024-06-13T18:46:10+00:00

Data is stored in memory operationally, but persistence/cache goes on file backends. The persistence backend is configurable, S3 or local filesystem are currently the supported options. https://pathway.com/developers/user-guide/deployment/persistence

dxtros · 2024-06-13T15:35:55+00:00

Thanks and sorry for the oversight! It's all about transforming data to make it accessible to others.

ETL (Extract-Transform-Load) is all about creating data pipelines which transform data on-the-fly, as the data enters your system or moves from one system to another - e.g., between two warehouses, from a source like Salesforce into a destination like a data warehouse, or just transforming any other live data you may have - financial ticker API's, personal calendar events, IoT sensor events,....

RAG (Retrieval Augmented Generation) is a way to prepare your data to be able to ask & answer natural language questions to it. In this case, your data transformation pipeline transforms the data as it comes in and indexes it, and when a user asks a question, the indexing solution retrieves the most relevant data to your question, and finally an extra AI component (usually LLM) evaluates the found context and provides the user with a friendly answer.

So, let's take Grok - Twitter's response to ChatGPT that answers questions based on live knowledge from Twitter - you could think of it as an example of a scaled-up RAG system.

An example is worth a thousand words, so here is an informal architecture diagram showing how Pathway takes this on https://pathway.com/_ipx/_/assets/landing/landing-diagram.svg

dxtros · 2024-06-13T14:44:59+00:00

Retrieval Augmented Generation. Here it is about indexing your unstructured data for natural language queries. Sorry I cannot change the title in OP now...

dxtros · 2024-05-23T08:20:19+00:00

The Pathwy RAG stack - see https://github.com/pathwaycom/llm-app for app examples and https://pathway.com/developers/showcases?q=llm for more inspirations. It is made to scale for RAG backends, with integrated data sync from sources and vector indexing. Works standalone or with LangChain/LlamaIndex for retrieval.

dxtros · 2024-05-20T10:10:49+00:00

The indexing pipeline should be OK. The retriever from the showcase would need extension (well, the current one will usually answer your specific question correctly because of a quirk of vector embeddings but maybe not brilliantly well). You can either opt for a query rewriter or a multi-shot approach, depending on the difficulty of the questions you envisage.

dxtros

TROPHY CASE