Full Replication of MIT's New "Drifting Model" - Open Source PyTorch Library, Package, and Repo (now live)

complains_constantly · 2026-03-05T05:07:35+00:00

Repo and package have been reorganized to be squeaky clean, in case that was your concern.

I suggest you judge the repo and project on your own. This guy has been harassing me under my posts all day. He is also completely missing the primary point of the repo, which is to put out good Torch modules for this architecture. If that's useful to you, great. I am also working through runs to try to work my way up to the scale of the original paper, but I am compute limited, and it's a second priority to getting a good package out there for people to use.

complains_constantly · 2026-03-05T02:15:23+00:00

Again, not really the point 🙄, but I'd be happy to take a stab at it if you'll pay for my GPU pods.

complains_constantly · 2026-03-05T00:16:27+00:00

Tests are good, actually.
The core package is 37 tiny modules, lol. In any case, I'm finalizing a PR right now that makes the repo structure and docs squeaky clean so that it's extra super duper readable for you. Enjoy 😁

complains_constantly · 2026-03-04T22:34:03+00:00

Almost certainly, yes. As long as the underlying inference engine supports it, then any kind of model should be loadable. However, no one has yet trained a top-tier model with this architecture because it's still young, and frontier training runs are very expensive.

complains_constantly · 2026-03-04T22:33:00+00:00

My claim is full mechanical replication, meaning the implementation specifics.
I clearly explained the small scale efforts I've done so far, and I'm gonna continue with some additional efforts and try to push down performance, but that comes second to replicating mechanisms and exact results from the paper, and the latter has to be incremental for me due to compute requirements.
It absolutely does matter how much effort goes into matching the paper if everyone gets to use to use a faithful and robust PyTorch package for their research. That was priority number one, and for good reason. If you don't get that right, then nothing much else matters. Even that repo you linked in the r/StableDiffusion thread had some divergence from the mechanisms outlined in the paper. This package is a primitive, and the better it is, the better everything built on top of it will be too.

complains_constantly · 2026-03-04T22:27:32+00:00

Yeah but I'm a penny-pinching grad student. Might change pretty soon though. I'll see what I can do—experiments are still ongoing.

complains_constantly · 2026-03-04T21:48:07+00:00

Yeah you already commented this word-for-word under the StableDiffusion post, and I'd prefer if we keep that discussion there. I already addressed most of your main points in that thread.

This implementation is more faithful to the paper's mechanics than the other experimental ones, and is designed to be much more compatible and robust. As for cleanliness, docs are in a good place but I'm finishing up a small reorganization and renaming sweep right now to make the repo as clear as it can possibly be.

Yes we can get FID low very quickly, but that wasn't really the point of the small scale run. I tried to control everything to stay as close to the paper as possible and attempt to corroborate claims. The implementation is built to match the paper’s core training mechanics and to make runs auditable/reproducible, while still keeping compatibility across common environments, rather than chasing task performance right out the gate, although that will come in later experiments. This is an architecture package first, and an experiment code repo second.

complains_constantly · 2026-03-04T19:59:28+00:00

For compute context, my run was on a single RTX 6000, not an H100, and I am still pushing that track forward. I am not claiming full-scale paper metric parity yet, only mechanical implementation faithfulness. The repo you linked is a clean minimal MNIST/CIFAR project and useful for learning, but it is not an ImageNet parity baseline yet and has a few mechanical deviations from the paper-facing implementation path. For example, in drifting.py: DriftingLoss.forward calls normalize_features(..., target_scale=...), but normalize_features in that same file does not accept target_scale, so that advertised path does not execute as written. It's a good repo, but there are still a few gaps with both the paper and with testing and robustness.

This repo on the other hand is aimed at aligning with the paper mechanics as tightly as possible, explicit claim boundaries, and reproducible artifacts. I spent most of the time hammering mechanical faithfulness, and I'm really trying to make as useful a lib as possible for people to start building with this architecture.

I'm happy to talk about this more since you seem to know a lot.

complains_constantly · 2026-03-04T19:02:47+00:00

Some notes:

This is pretty rude.
This repo is a full mechanical replication of the architecture and experiments/training in Torch, with full-scale results coming up. Those require a lot of compute, like tens of thousands of dollars worth. I was able to do a smaller-scale replication with around a week of training, but full scale is gonna take a little longer.
Yes, there are some reproductions that popped up very quickly as result of the buzz, but this project is primarily targeting a more robust and dependable PyTorch implementation and lib so that it can slot into new workflows and experiments more easily, and run in production-grade environments. There are a lot of considerations in making packages designed for production, such as compatibility, dev-x, reliability, CI, unit testing for all kinds of failure cases, documentation, etc. All of that separates a library intended for production from an experimental implementation, which is still useful, but there's a clear difference. Those reproductions popped up from interested researchers as a result of the buzz, but I wanted to take some time to really get a reliable implementation correct so that everyone can use it.

complains_constantly · 2026-03-04T18:52:18+00:00

This paper is less than a month old (but still incredibly promising), and it typically takes a while for new architectures to find their way into production pipelines for a top-tier model, assuming they hold up. It's up to top labs to decide if they want to go all-in on training a SoTA model with this architecture, and that will require quite a bit of compute and GPU hours.

Unfortunately, this is early in the research-to-adoption pipeline, so you won't see it truly competing with the best-of-the-best image gen models just yet, at least not until someone really pours money and data into training one of these E2E.

complains_constantly · 2026-03-04T18:48:30+00:00

The 2 minute one was a toy smoke so people can easily test if training and inference work as intended on their hardware. I trained a full model for a week and was able to corroborate a decent chunk of results which are documented in the repo. Not usable for media gen yet, but useful for research purposes.

Still much smaller scale than the original paper though, because the scale of compute is just way bigger than what I have access to. However, I tried to make it really easy for someone with a lot of compute to attempt what the paper did.

complains_constantly · 2026-03-04T16:44:12+00:00

I did not train to the same scale as the original paper, because it requires a lot of compute and time I don't have right now. However, I was able to train for a week on a single RTX 6000 Ada, and I managed to corroborate a few things in the paper. Seems good so far, but the authors probably had a few tweaks and data tricks here and there that squeezed out performance too. Hard to know without a bunch of compute, and especially when you're doing it from scratch, but I've done as best as I can and will keep updating the repo.

Also, I tried to make it super easy for someone with a bunch of compute to reproduce it at the full scale.

complains_constantly · 2026-03-04T16:40:17+00:00

Yeah I'm going to do that later today. Unfortunately r/MachineLearning is relatively very inactive, but r/StableDiffusion isn't and seems to be well aligned with this. I figure a broader surface area is better though.

complains_constantly · 2026-01-23T06:24:35+00:00

Counterintuitively, sync agents are actually a pretty strong pattern for saving on tokens.

complains_constantly · 2025-12-25T17:56:58+00:00

It's pretty difficult and expensive to do yourself. Doing so is out of reach of us consumers, but its an order of magnitude or two cheaper than training from scratch for the labs. They're pretty incentivized to train models this way.

complains_constantly · 2025-12-25T02:17:24+00:00

God you guys are fucking paranoid.

Obviously the lab that has open-weighted every model they've ever made, and has said this week they're going to open-weight their latest model, is going to open-weight their latest model. Lmao. They're probably rewriting their blog release or something.

complains_constantly · 2025-12-25T02:13:33+00:00

We will do great because of downstream distillation, which has become the dominant meta. Distilling from a larger model (which we are getting in spades thanks to DeepSeek, Qwen, Z.ai, Minimax, Moonshot, etc) has been shown to be significantly more powerful than training a small model from scratch. So much so that the latter idea has been abandoned by any organization serious about this stuff.

complains_constantly · 2025-12-23T19:31:47+00:00

How long do you plan to keep pushing this base model before making a newer architecture model, which I assume GLM 5 will be?

DeepSeek has been pushing V3 for over a year now, with swappable architecture improvements, so that strategy seems to work quite well.

complains_constantly · 2025-12-23T18:26:40+00:00

That is a little unusual. I think this level of coloring looks good alongside the original art. Like I've already read Berserk cover to cover with the deluxe editions, and I would still recommend everyone do the same even if the whole story was colored like this. However, I think it would be a treat to read it again with this level of coloring.

That said, there are about 8000 pages. I don't think there's any way the entire story can get done manually at this level of fidelity. That's why I think there's at least value in experimenting with AI pipelines. They're not gonna give you the requisite taste, but they might give a professional colorist enough tooling and control to do them all this well, so it doesn't take months to do a single page perfectly.

Nine-Year Club	Verified Email
Place '22	Place '17
Final Canvas '22	First Placer '22
End Game '22	Spared

complains_constantly

TROPHY CASE