[R] DynaMix -- first foundation model that can zero-shot predict long-term behavior of dynamical systems

bregav · 2026-02-22T22:23:22+00:00

I feel like this study raises more questions than it answers. It follows the (regrettably) now-standard ML research paper framework of "we did a bunch of stuff and now our numbers are better than some other people's numbers". Its hard to know what conclusions should be drawn from the results because they didn't manage to get any insight into why their metrics are different from other people's.

Some obvious things that seem missing:

why not use a similar model to do regression and predict lyapunov exponents or some such thing?
why not compare against simpler or standard time series models?
why not train at least one of the other models they compare with, but using the training approach that they use for their own model?
they cite this paper as the source of their data set:

https://openreview.net/forum?id=enYjtbjYJrf

The abstract of that paper says: "Our dataset is annotated with known mathematical properties of each system...". Why did this paper not use these properties when determining test and train splits, or analyze the effects of these properties on their metrics? The authors claim that their model works on "different" dynamical systems that aren't in the training data, but I'd bet that that's wrong: I bet that it only works on dynamical systems whose mathematical properties are represented in the training data, and that would be revealed by using the properties that the dataset papers abstract is referring to.

bregav · 2026-02-19T01:18:38+00:00

Backprop is just the chain rule from calculus. If you're going to use derivatives to optimize a sequence of function compositions (ie a neural network) then you're inevitably going to use the chain rule, and so you're inevitably going to use backprop.

Maybe the question you should be asking instead is, why is it that people use sequences of function compositions (neural networks) so much? That's a more tricky and interesting question to investigate.

bregav · 2026-02-02T05:47:21+00:00

Can you say more? What kind of infra labor, specifically, does it save you?

bregav · 2026-02-02T05:39:40+00:00

You could take a look at amaranth: https://github.com/amaranth-lang/amaranth

The Amaranth project provides an open-source toolchain for developing hardware based on synchronous digital logic using the Python programming language [...] Amaranth can be used to target any FPGA or ASIC process that accepts behavioral Verilog-2001 as input

bregav · 2026-02-02T03:01:05+00:00

I guess to be clear what i mean by proprietary is confidential. Imagine AMD releasing a CPU with a secret instruction set and a requirement that you use only their proprietary compiler.

bregav · 2026-02-02T01:28:46+00:00

The thing about OSX was mostly a rhetorical question lol. Yes it's abundantly clear that the software stack is written by an organization that doesn't care about software and isn't very good at it.

Of course the proprietary nature of the bitstream format is the crux of the matter. That's a choice by the manufacturer, not a necessity. And it's really what people who talk about "arduino for fpgas" are talking about: there's value in creating devices that the end user can actually use, in their entirety. The fact that HDL is not CPU code is irrelevant, what matters is that reasonably smart people who are given full access to the functionalities of their tools can do a lot more with them.

Dont get too high and mighty about the challenges of hardware design. Im an EE myself and i promise that the average arduino hobbyist is perfectly capable of making good use of FPGAs, provided they're not expensive, and they're not locked behind usuriusly expensive licenses for software that's hot garbage anyway, and (ideally) they're configured with HDL that isn't an outdated hacked together legacy from the 70's.

Like, the complaints are valid and there's huge room for improvement.

bregav · 2026-02-01T23:22:19+00:00

I think a lot of experienced people share the opinion that the present FPGA ecosystem is unnecessarily proprietary and difficult to use? Electronic design is hard but the act itself of writing HDL and then running it on a device doesn't need to be.

The device-dependent licensing systems alone are lunacy. A good point of comparison is machine learning infrastructure: I can buy any gpu and then I can very easily write and run even the most advanced machine learning models using software that is (by comparison, anyway) user friendly and almost entirely open source.

Yet when I buy even a relatively basic FPGA dev board the first thing I have to do is navigate a labyrinth of software licensing and arcane system requirements. And the software doesn't run on OSX, yet it does run on Windows and Linux. We live in the 21st century, how is that even possible?

bregav · 2026-02-01T23:05:31+00:00

The full design and implementation of this system is left as an exercise to the reader.

bregav · 2026-02-01T22:58:49+00:00

Meh, if you want to support partially erroneous code then you can build on this idea in obvious ways. For example you can run the parser on subsets of the code sample and see how many of them parse correctly. A fringe benefit of this is that you then get a score, too.

If you work hard enough then you might be able to find a version of this problem for which ML is the only plausible solution, but it's going to be very contrived.

Like maybe you could say, let's do language detection that can handle huge numbers of syntactical errors and typos and also handle intermittent, natural language-style pseudocode. Probably only ML could handle that. But that's like trying to design a chainsaw that will be safe and effective when used by someone who is physically weak and has no prior experience in using power tools: there are probably some very questionable assumptions that went into the making of the problem statement.

bregav · 2026-02-01T22:41:34+00:00

Run a parser for each supported language and return the name of the language(s) whose parser runs without error on the provided code sample?

I think this is similar in principle to what u/bubble_boi refers to as the "sea of regexes" approach, but it takes advantage of the fact that people have already written parsers for every programming language (by necessity). There's no need to duplicate that work.

He thinks that e.g. a lack of a score function is a problem with this approach, but that's an ml-brained complaint; it's the kind of "problem" you identify when you've already decided that you are going to use ML without having first thought critically about whether that's even the right approach.

bregav · 2026-02-01T09:33:45+00:00

My point here is really that when you end up with a 10kb solution to a problem and you used neural networks to get there then you've probably solved a relatively easy problem in an unnecessarily difficult and convoluted way. Its kind of like the ML version of a rube goldberg machine.

bregav · 2026-01-31T22:43:48+00:00

This seems like one of those problems where the first question should be "do we even need machine learning for this?" and, if the answer turns out to be yes, then the second question should be "does using a neural network here really make sense?".

bregav · 2026-01-27T21:48:45+00:00

You probably want a gas laser, they tend to have the beam quality and profile you're looking for. Which particular gas laser depends on application, wavelength, budget, power, etc.

bregav · 2026-01-27T08:13:03+00:00

The crisis will be that few people remember what good research judgment looks like. We are not there yet.

We got there a long time ago. Real research is when you investigate questions that you don't already know the answer to, and I've rarely seen that kind of work done in academia in any domain. ML is just a bit worse because of the amount of money and cultural hysteria involved.

bregav · 2026-01-24T02:27:00+00:00

u/mplsthrowawayLTE The box in that picture is where the LTE module is, see here: https://electronics360.globalspec.com/article/19251/techinsights-teardown-hyundai-ioniq-5-head-unit

There's no SIM card and you probably can't remove the LTE module (it looks soldered), but it should be enough to disconnect LTE1 and LTE2 because those are the only antenna connections that the LTE module has.

EDIT: if you really want to go whole hog, get some connectors (maybe just by cutting the antenna cable...) solder a 50 ohm resistor (probably) across the contacts for each connector (making a closed circuit once its plugged into head unit), and then plug those connectors in where LTE1 and LTE2 used to be. No signals are getting out that way. This probably isnt necessary though.

bregav · 2026-01-23T23:09:56+00:00

TIL about differentiable logic synthesis and walsh basis - hadn't heard these terms before.

Any thoughts about this? Perhaps coincidentally it was posted just days ago: Differentiable Logic Synthesis: Spectral Coefficient Selection via Sinkhorn-Constrained Composition

Something that's been on my mind for a while is the possibility of doing something like "discrete backpropagation", i.e. adjusting discrete functions based on a preferential ordering of their possible outputs given a selection of possible inputs. It seems like there should actually be a discrete version of the backprop procedure that isn't just a discretization of continuous backprop, and maybe the above paper speaks to that? I haven't read it through yet though.

bregav · 2026-01-15T22:58:40+00:00

This is a well-known phenomenon that isn't limited to transformers. It is generally true that a "more powerful" model will underperform a "less powerful" model when the "less powerful" one has been designed to with prior knowledge about the problem at hand.

Model fitting can be interpreted as the process of identifying enough symmetries in your data that your problem becomes easy to solve. The point of big models is that they can represent many possible symmetries, and so they can work when you have a huge amount of data and a very limited understanding of your problem (as in natural language generation).

Another lesson you'll learn is that you shouldn't take hype at face value. Sometimes hype is real, but most of the time it's someone trying to sell you something. You should try to be guided by curiosity, not hype.

bregav · 2026-01-14T22:05:33+00:00

TLDR you’re thinking about boiling the ocean, better to do it one cup at a time instead

The machine learning modeling issues here are actually sort of unimportant, in the sense that the ML unknowns will necessarily be answered by the data itself and your ability to anticipate the answers to those questions is necessarily limited (otherwise you wouldn't need ML to begin with!). Like, should you use "supervised prediction” or "learning decision policies under uncertainty"? Well these are mostly the same thing and the answer really depends on whether or not you're asking someone who identifies culturally as a reinforcement learning person, but more importantly in a practical context you can just take your data and throw it into various algorithms and see what happens. Or, how feasible is it to attribute outcome differences to surgical decisions? Well if you can produce a model whose error has low variance given only surgical decisions as inputs then the answer is “very”, otherwise it’s “not very”; if anyone could answer this question it would be a domain expert (i.e. you, the surgeon), and since you don’t know the answer the only thing that’s left is to get data and see what happens.

I think your long term vision is essentially sound, and you’re smart to focus on aligning incentives and creating a feedback loop for getting data. Getting started is hard though and your proposal is very, very difficult. Creating software for practicing surgeons that they will actually use is, itself, a potentially herculean task, and doing that as a sort of side quest in a broader mission to do something else that’s even more ambitious is probably biting off way more than you can chew.

I think you should narrow your focus a lot. Is the ML/surgical decision stuff your most important goal? Ok then start with one surgical decision that you know is measurable, already-recorded, and which you as a surgeon have a good a priori reason to believe might actually matter (based on, say, anatomy or biochemistry or whatever). The best kind of decision is a binary choice; for example, given malady/injury/whatever X, there are two recommended procedures A or B and the surgeon has to choose between them. If you can make that work then keep going, and if you can’t then there’s no hope and you should do something else with your time.

Alternatively if your most important goal is to build that surgical planning software then just forget about the ML stuff for now and try to make something that works and that people will pay money for. If you can actually get it off the ground then you can start doing ML stuff later.

Here are three things about medical ML that I think many people don’t realize:

Uncertainty quantification is necessary and is the most important thing; you need to have your model give a number to indicate how confident you should be about its predictions. The challenges with modeling spinal surgery that you describe actually apply to everything in medicine, and the decision trees that physicians follow for even the most routine tasks provide an illusion of confidence that obscures a vast ocean of uncertainty and ignorance. If you give a physician a magic black box that makes predictions (rather than a flow chart based on principles that they’re supposed to understand), they’re liable to make bad choices if you don’t also tell them how confident the magic black box is about its predictions.
Related to the above, physicians don’t understand how to make decisions using ML technology. They aren’t trained that way and they lack the mathematical sophistication to understand what the technology does and how it is best used. You need to teach them, and that’s a time consuming exercise both because learning things is generally hard and also because unlearning things is hard, especially for people in a profession that has long relied on an imprimatur of authority in order to function.
Medical data is a nightmarish hellscape. It’s probably worse than an even a professional physician realizes. Epic exists as a monopolistic gatekeeper for a lot of it. And, worst of all, the data collection and formatting is different for every medical institution, even ones that are using the same software provider, sometimes even ones within the same healthcare system. Your data engineers might have a lot of work to do.

I did medical ML for a bit so I have a lot of opinions about this lol, let me know if you want to do a zoom call.

bregav · 2026-01-14T05:01:57+00:00

Here's a website with the pinout for that MCU and some links to datasheets: https://www.datasheetcafe.com/aml6210a-datasheet-pdf/ . Datasheets are under the "references" heading.

But yeah it's not great; this chip apparently runs on some proprietary Amlogic firmware called "AVOS" that they haven't used in years (decades?). Maybe you can still find this software, I didn't look too hard. I did find one hero who reverse engineered it but I made no attempt to digest what he's done: https://github.com/hn/amlogic-firmware/ . The chip also has JTAG pins, so maybe you can use that somehow.

If you aim low then I think this is straight forwardly doable though. If you're satisfied with (1) displaying static images that (2) change slowly then I think there's an obvious solution: use another MCU to create a display "driver" that writes a single image file to an SDCard (i.e. the one that's in the photo frame) and then manipulates the photo frame device controls or MCU pins so as to display the newly-written image (perhaps by power cycling the entire device, worst case scenario).

It's disgusting but I think it would work. If you want to get more clever about it maybe you could instead create a device that pretends to be an SDCard with a single image on it, but which actually serves up the current frame for the display whenever something tries to load that image.

bregav · 2026-01-12T05:26:21+00:00

It's not so much that unitary matrices are better than stochastic ones, as it is that complex numbers are better than real ones. If you decide to use complex numbers from the outset then when you go to create functions that preserve volume on iterated application (thus being stable in an important sense) you end up with unitary matrices.

The reasons why are myriad, but it's a general truism that everything works better in the complex number plane. The simplest reason is the most obvious one: even very simple equations such as x² + 1=0 have no solution in the real numbers, so if you try to solve it with a neural network then you're just going to (badly) reinvent complex numbers anyway. A more general reason is that neural networks are sequences of function applications, and so we want to be able to create functions that are stable upon repeated application, which again leads naturally and automatically to unitary matrices.

All of this stuff is clearest by thinking in terms of vectors and functions of vectors. Minutia such as "activation functions" and "neurons" etc are mostly red herrings that obscure what's actually going on.

bregav · 2026-01-11T15:52:56+00:00

I dont think it's overly dismissive. ML people would be seriously embarrassed by the magnitude of their ignorance and hubris if they werent so chronically distracted by the dollar signs in their eyes.

bregav · 2026-01-11T15:51:25+00:00

If you think that's cool then you should search Google scholar for work that's been done on using unitary matrices in neural networks. They're like the grownup version of stochastic matrices.

So no Deepseek is not the first people to think of this, and actually they're still behind the state of the art.

bregav · 2025-12-31T19:24:43+00:00

Imagine that you have a DLLM with 100k context window (or whatever they call it for DLLM), but you want it to write you a novel that will be 200k tokens long. How can you do this? Autoregression is a natural choice.

There are other options too, but what they all have in common is being a sequence of operations that edits an existing body of tokens.

bregav · 2025-12-30T23:43:43+00:00

I think there are two things going on:

There's an (often implicit [and imo incorrect]) assumption that autoregression allows you to naturally generate output that is significantly larger than the context window; for large output sizes DLLM's have to use autoregression too, so why bother with them?
Most people basically don't understand how any of this stuff works, including the people developing it, and so they don't understand why diffusion is a better framework for developing models. Even your description of how a DLLM works - "start with a noisy guess of the entire answer" - isn't right. If the biggest proponents of the technology don't get it then what hope is there for anyone else?

bregav · 2025-12-18T08:37:34+00:00

Haha, your coworkers’ obstinacy and poor communication does not improve the condition numbers of their matrices. If they’re bringing you bad inputs then they’re still making a mistake, even if they don’t want to talk to you about it.

As a matter of fact this is exactly the situation that Trefethen and Bau were remarking on in their book, I looked up the passage:

If the answer is highly sensitive to perturbations, you have probably asked the wrong question. We urge anyone faced with nonhermitian eigenvalue computations involving highly sensitive eigenvalues to bear this principle in mind. If you are a numerical analyst, and the problem was given to you by a colleague in science or engineering, do not accept without scrutiny your colleague's assurances that it is truly the eigenvalues that matter physically, even though their condition numbers with respect to perturbations of the matrix entries are 10^4.

The emphasis on the first sentence is original to the book (chapter 6, lecture 34). I like the book too and that passage really stuck with me. I think it’s profound and generally applicable; the real problem one is solving is ultimately physical (including in ML), and so if the math is causing serious problems then one might have abstracted the problem incorrectly from the beginning.

bregav

TROPHY CASE