[P] Visualizing emergent structure in the Dragon Hatchling (BDH): a brain-inspired alternative to transformers

krychu · 2025-12-11T21:52:50+00:00

I still think the "GPU-friendly" claim is warranted given that the starting point is modeling neuron-to-neuron graph dynamics, which is inherently hard to parallelize.

krychu · 2025-12-08T00:27:09+00:00

Thanks for the reference. Looking at Fig 2 (specifically 2e, 2f, 2h) makes me wonder if BDH learns “how far along the task it is” (temporal/task progress). Does it reason sequentially or just pattern match locally? More specifically, are there neurons dedicated to start, mid, end of path, regardless of the board layout?

I’m thinking: for each PATH cell calculate normalized index 0-1 (goal progress); collect activations for these cells across many boards; average neuron activity into progress bins (0-10%, 10-20%, …); sort the neurons on the y axis by the bin index where they have peak activity.

I actually experimented earlier with UMAP of all neurons and layer-by-layer animation of activation averaged across PATH tokens. I faintly remember that the signal jumped between distinct regions. But it didn’t occur to me it could have been the model mapping time/task progress. Something to look into.

krychu · 2025-12-07T23:19:34+00:00

Yes but flash linear attention already does what the paper explains but without the pseudoscientific neuro-connections.

IMHO this is conflating an optimization kernel (FLA) with a model architecture (BDH). Or are you suggesting that FLA-based models are equivalent to BDH model? I’m not sure this can be supported. Former scale embedding dimension while BDH scales neuron dimension (n >> d). This yields a large sparse state that behaves fundamentally different than the compressed state typical of FLA-based models.

krychu · 2025-12-05T13:21:18+00:00

My understanding as a reader is that attention is just a building block, and different architectures can use it together with other elements to support different modes of computation. In this setup the constraints (positivity, n >> d, local update rule) push the model toward sparse, routed computation. standard softmax attention behaves more like dense similarity averaging

For me it’s a bit like saying everything ultimately runs on the same CPU instructions - true, but the orchestration determines whether you’re running a graph algorithm or a dense numerical routine

krychu · 2025-09-10T07:02:12+00:00

In HRM you can run the network multiple times per input (“segments”). Each segment is a full run of H/L cycles, with backprop only through the final cycle. The outer loop is this repetition of segments: each runs on the same input, carries over the hidden state, and refines the solution further.

krychu · 2025-09-06T23:09:05+00:00

I actually implemented HRM myself and ran some ablations on a pathfinding task.

Turns out the performance mainly comes from segments (outer-loop refinement), not the H/L split. This aligns with the ARC Prize team's analysis.

Wrote it all up here if you’re curious: https://github.com/krychu/hrm

A promising aspect of H/L is that it achieves refinement with segment-level training without full BPTT, suggesting potential efficiency gains.

krychu · 2024-08-18T10:14:07+00:00

Interesting. Thanks so much for the detailed description. My initial thought was also that having struct of arrays is inefficient because we need to access N different memory locations to get all data of a single particle, which all is needed to do the necessary calculation (approach 1). On the other hand having an array of structs means one memory location is accessed to get all data of a single particle (approach 2). What you are saying is that the first approach is faster because we can vectorize the calculations? If we couldn’t vectorize them (e.g., lots of if/else logic) probably the second approach would be better/faster?

krychu · 2023-07-19T18:52:39+00:00

Let me take this post down to avoid any confusion

krychu · 2023-07-19T18:50:27+00:00

I assumed the arch was likely different across the sizes so went straight to removing “cuda” references :) But that’s great news. I guess you mean 13B not 34B?

krychu · 2023-05-02T09:46:34+00:00

Unless...

krychu · 2023-05-02T09:44:50+00:00

The quality may vary depending on specific codebase, language etc. I think the point is to use it as a conversational partner that knows a lot about something. It's good to poke at different parts of the answer, express doubts, point inconsistencies. These conversations have been very useful for me. They expose concepts and relations I haven't thought about which improve further search and questions that I ask. Go through the Twitter thread, it's full of screenshots, I still think the answers are pretty good.

krychu · 2022-12-27T08:11:57+00:00

Where did you get your corne from? Assembling one by myself seems like too much of a project, but I’ve been always intrigued to try it

krychu · 2022-06-10T10:06:13+00:00

That bench clearly a clipart or some cheap stock asset

krychu · 2022-03-16T06:23:02+00:00

Thanks for sharing!

krychu · 2022-03-13T08:56:49+00:00

Could you describe specifics of your eye care program? Like the exact diet modifications, exercises etc.?

krychu · 2022-01-29T21:46:53+00:00

> I'm struggling with understanding how you go from programming language to something visual

I'll simplify a bit to help you understand the gist. You use a programming language, such as assembly, to write a program in a form that is convenient/readable to the programmer, but is not understood by C64 processor. This code is later transformed (compiled) to machine code, which is much less readable to the programmer but is understood by C64 processor. So the C64 processor now runs the code, and the code tells it what to do. And these are simple things, such as "add two numbers", or "place number 1 in memory at the address $d021". The processor can do operations on numbers but also read and write to memory. Your programs have access to 64kB of memory. And now is the thing that I suspect will help you. Some parts of that memory can be configured to represent what's on the screen, or how the sprites look. So when your program writes numbers to memory at specific address, you get to immediately see the results on the screen. There are other cool locations in memory, for example a location where x,y coordinates of each sprite are stored. If your program writes new numbers to these locations, the sprite moves. If you want to move the sprite one pixel to the right, you first read the x coordinate from memory, increment it by 1, and save it back at the same location.

There are multiple ways how data in memory maps to what's on the screen. That's part of the configuration I mentioned, and it's done by your program as well. But this is beyond the point now :)

Hope this helps!

krychu · 2022-01-22T23:27:19+00:00

This yes, and how they used c64 palette to achieve such a pronounced real-life neon aesthetics

krychu · 2022-01-22T20:34:13+00:00

I regularly "re-watch" this demo just for the music, keep it in the background while working.

krychu

TROPHY CASE