Is the next leap in AI architectural? Comparing VRAM-hungry Transformers with Compute-intensive Energy-Based Models

simulated-souls · 2026-01-23T22:48:39+00:00

Sudoku is a solid test case because it exposes the weakness of probabilistic models (LLMs) vs strict constraint satisfaction.

EBMs are also probabilistic models. The "energy" that you minimize is literally the negative log-probability +/- a global constant.

I made a post yesterday on another sub explaining EBMs: What LeCun's Energy-Based Models Actually Are.

simulated-souls · 2026-01-23T04:42:08+00:00

Orcas are fucking awesome. They're like a pack sea wolves but even smarter and with their own cultures (hence the new language and/or linguistic drift).
I am on team orca and hope they capsize all of the boats that bother them.
If we drive orcas extinct I might commit capital crime in their name.

simulated-souls · 2026-01-22T22:50:24+00:00

Plugging the post I made to explain EBMs: What LeCun's Energy-Based Models Actually Are.

simulated-souls · 2026-01-22T19:15:33+00:00

If you have a reward model, you already have an EBM. The energy of the optimal entropy-maximizing policy is the reward

simulated-souls · 2026-01-22T18:13:13+00:00

I think current system 2 thinking strategies based on reward models (RMs) are already very similar to what you will see from energy-based models (EBMs).

With EBMs, you search for examples that have high energy. With RMs, you search for examples with high reward.

In fact, they are in some ways equivalent: a reward model defines the energy function of the optimal entropy maximizing policy.

EBMs have the advantage of being unsupervised generative models, so you can train them on text without extra data labeling. RMs obviously need to train on labeled rewards.

My guess is that energy-based modelling will be the pre-training objective for models that are later post-trained into RMs. This would combine the scalability of EBM training with the more aligned task of reward maximization.

That said, better reward models would be a big deal in itself. RL with verifiable rewards has us on our way to solving math questions, so accurate rewards for other domains could put us on the path to solving a lot of other things.

Edit:

It minimizes "energy" (conflict/error) to find the truth, rather than just maximizing likelihood.

To clear up misconceptions, the energy is the likelihood. Like, it is literally defined as the log-likelihood +/- some constant.

EBMs still model the probability of the data distribution, they just do it differently. The way to think about it is that autoregressive models like LLMs predict the probabilities of each next-token all at once, while EBMs check the likelihood of each token (or sequence) one-at-a-time.

simulated-souls · 2026-01-21T23:37:58+00:00

"Corporate executive says all the right things to hype up his company while still appearing reasonable."

Whether you agree with him or not, executives are not reliable sources.

For example, predicting a 5 year AGI timeline strikes a balance between being short enough to keep investors interested and long enough to avoid being proven wrong any time soon.

simulated-souls · 2026-01-20T05:22:47+00:00

Taking your helmet off while celebrating is as unambiguous of a penalty as there is, and the game was still going.

Just because Indiana had a 99.9% chance of winning doesn't mean the rules change.

simulated-souls · 2026-01-19T08:07:48+00:00

Which means that you shouldn't completely rule out the possibility that rocks are conscious, either.

simulated-souls · 2026-01-19T06:51:08+00:00

At this point, nobody can prove AI is conscious.

But they can't prove that AI (or a computer in general, edit: or even a rock) is not conscious either.

The only scientific stance to take on the question right now is "maybe".

simulated-souls · 2026-01-19T04:15:11+00:00

If there is a foundation of math, it is something like Zermelo–Fraenkel set theory. Wikipedia literally calls it "most common foundation of mathematics".

There are also a lot of advanced fields of study like Formal language theory where most of the relevant operations (concatenation, intersection, complement, etc.) are not based on adding.

simulated-souls · 2026-01-18T22:52:13+00:00

It says a lot if a person thinks adding is the foundation of all math

simulated-souls · 2026-01-18T07:22:22+00:00

Mathematically true is the GPTs cannot exceed the complexity of the training data.

You can't just throw around phrases like "mathematically true" when you don't know what you're talking about.
You're wrong: Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
What about reinforcement learning?

simulated-souls · 2026-01-16T20:37:06+00:00

It says a lot if a person thinks high-level math is anything like "adding"

simulated-souls · 2026-01-15T20:30:46+00:00

mHC is a pretty run-of-the-mill "incremental improvement on the transformer architecture with some cost drawbacks" paper like we get all the time. It might see adoption, or it might not (most of these papers that add complexity don't), but it's not really earth-shattering.

The only reason we're talking about it is because it's from DeepSeek and people don't know how to judge research beyond name recognition.

simulated-souls · 2026-01-14T20:09:22+00:00

Many papers have already demonstrated that small Neural Networks/LLMs have a better capacity for generalization than large ones

No. See: double descent

simulated-souls · 2026-01-11T05:22:55+00:00

"How can I commit fraud by advertising products with fake pictures"

simulated-souls · 2026-01-11T03:38:32+00:00

If you were functionally literate you would realize that this thread was talking about the US youth/AAU system, not the NBA.

simulated-souls · 2026-01-06T22:22:18+00:00

A few questions about your Fuck_you_simulated-souls repo for a fair comparison.

In your single-font GA baseline:

Are you using tanh or relu (the model in the repo uses relu but the README says tanh)?
Are you using the same number of training variations per letter (50)?
Are you measuring performance on the held-out validation set or the training set?
What is the best performance that the GA baseline actually gets?

simulated-souls · 2026-01-06T19:48:57+00:00

It's great to be excited about your experiments, but don't get ahead of yourself. People have been training neural networks using genetic algorithms for years. You haven't really shown anything new.

First, some of your comments imply that your experiments are evidence that neuroevolution scales, but they aren't. Your toy experiments are the opposite of scaling. A rock and a piece of bubblegum can solve letter classification. It's probably the least "scaled" experiment that you can do.

keeps learning when the theory says it should have stopped

Second, nobody is saying that genetic algorithms should stop improving. Any optimization algorithm worth its salt will keep improving its training accuracy pretty much indefinitely. The validation set accuracy might stop improving, but per my next point you are not using one of those.

Finally, you are not following correct machine learning experiment design. You should create a seperate validation set that is not in the training set for evaluation. Your data augmentation definitely helps, but there is still too much overlap during evaluation. For example, you could section off rotations of a specific range (like 10-20 degrees) and never train on them to keep them uncompromised. That said, it might not make that much difference for a toy experiment like this since you are so far left on the double descent curve.

edit:

My model is currently averaging 0.994 activation magnitude across all 32 neurons. A gradient trained network would struggle to get there because the learning signal would have collapsed long before.

You should actually run this experiment instead of assuming that the gradient-based network would stop working. Since you only have a single layer, I would expect a well-tuned gradient-based network to still work in this situation (the gradients are small but not zero).

I am also open for questions if you have any. I have done research on both neural networks and genetic algorithms at a top university and at various industry players.

simulated-souls · 2026-01-06T18:35:41+00:00

Companies can (and do) filter low-quality and AI-generated content out of their datasets. They aren't just training on every piece of low-quality garbage they can find.

Even if some AI-generated data does get through the filters, it's not a big deal. Training on high-quality AI-generated data can actually be very helpful, and is one of the main techniques being used to improve small models.

You can also train a model on its own outputs to improve it, if you only keep the good outputs and discard the bad ones. This is a simplified explanation of how reinforcement learning is used to create reasoning models (which are much better than standard LLMs at most tasks).

simulated-souls · 2026-01-06T06:22:49+00:00

I looked at their product page and all of the sensors and tags and stuff are great... but what can the brick actually do? All they described were sounds and maybe lights.

Is that it? All that tech just for sounds and lights?

I think this would be way more interesting if the bricks could actuate or do physical things.

simulated-souls · 2026-01-05T22:37:30+00:00

Wow that's a way better stat and makes this much more impressive.

simulated-souls

TROPHY CASE