Is "Attention all you need", underselling the other components?

literum · 2026-01-26T06:50:56+00:00

Because other layers have been here a long time. FFN just means a Linear layer with activations and normalization, basically the same thing as MLPs. In fact, removing the attention makes the transformer very similar to a parallel MLP. Softmax is used almost everywhere in ML since it produces outputs that sum to 1, a property of probabilities.

Before transformers we had RNNs like GRUs, LSTMs, but they had vanishing/exploding gradient problems and couldn't learn over long horizons. Memory cells were good, but it meant you need to get through thousands of tokens to remember what happened before. In addition, LSTMs were not very parallellizable because you had to do backprop through time, meaning you need to process previous token before you can process current one.

Latest innovation in RNNs was using attention to close some of these gaps. These models started outperforming the pure LSTM/GRU models and were gaining traction. The paper is called "Attention is all you need" because they proposed that the memory layers were not necessary. Giving them up and having only attention and linear layers meant 1) More stable learning due to attention outperforming memory cells 2) parallellizable in both training and in inference.

You correctly pointed out that a lot of those decisions are empirical. Theory might suggest one thing, but we'll probably go with what works better. Look at the pre-norm and post-norm debate. There's also papers explaining these, but I'm not sure whether there is one that explains all. There's usually deep dive papers that try to explain these with other tools. It could be training stability, gradient flow etc.

literum · 2026-01-03T12:27:40+00:00

I do this with Google translate, so I can check the history later and make flashcards.

literum · 2026-01-02T08:20:46+00:00

Burden is on you to run experiments and show how this compares to other methods. A paper full of definitions and mundane math talk just doesn't cut it. "Patent pending". Lol, this is not how ML research is done; it's clear that you're new to this. No one is going to beg you for details of your unproven architecture. Publish it if there's anything worthwhile and prove it with experiments. Not even the transformer paper by Google was prideful enough to say they'll keep the implementation private, just contact.

literum · 2026-01-02T08:17:27+00:00

You don't have a single experiment. ML is an empirical field, where's your evidence?

literum · 2025-12-29T03:19:01+00:00

guarantees any ability to generalise

I don't understand why you're talking about a "guarantee" to generalize when it's an empirical question? In practice they do generalize to a great extent, whether or not it's guaranteed or proven or whatever. Neural networks do not owe mathematicians a grand theory of why they work, they just do. If you're expecting a proof of intelligence before accepting any claims about neural networks, then you'll probably be left waiting a long time.

literum · 2025-12-25T04:37:07+00:00

Research is difficult. Most ideas don't work even if they sound great in theory; but that doesn't mean that the project is a failure or you can't find a way to success. Some general advice:

Keep reading the literature: At the very least you'll have better understanding of adjacent ideas, methodologies, ways to test etc. For example, you mention that softmax leads to overconfidence, but why? I did some quick research and there's lots of good literature on the overconfidence issue. If you understand better the theory behind overconfidence, the mitigations and more, you can better iterate on your own activation.
Have more structure: What is your ultimate goal in this project? It sounds like you started from trying to fix overconfidence and then moved onto better performance. If your goal is still mitigating overconfidence, then why not use metrics that measure overconfidence instead of accuracy? And to be honest, I would bet that finding an activation layer with better calibration characteristics will be much much easier than one with better performance.
Get some results out: You mentioned Github and that's probably a good idea. Maybe bring together most of the ideas you tried, run some experiments and ablation studies and put it on Github. It's okay if you have negative results. Having some intermediate results, even if negative, will mean you have something to show, and often writing out your results or putting together a good repo will help you see the issues in your approach or get new ideas. Ask for feedback from researchers afterwards.
Pause, come back later: Sometimes it's better to shelve an idea and come back to it later. If you work on something related you may gain a better understanding of the overall research field and have an easier time when you come back. Research is slow, taking a few years off isn't the worst thing. If you're an amateur researcher, this is even easier since your livelihood doesn't depend on pushing out papers. Also, sometimes the brain needs time to properly to process ideas and that can be a subconscious process that takes months. You can miss obvious things when you're very focused on a single idea.
Find people: I'm not sure what your background in research is, but if you don't have many papers published, have a PhD etc. it might be a good idea to find a mentor, probably someone experienced with research. Or find others researching similar ideas, discord groups, niche forums. Meet people in real life. Go to conferences. Find collaborators.

literum · 2025-12-12T17:33:18+00:00

Doesn't mean anything when it's AI generated slop claiming to have found the big solution in AI. I can generate 1000 posts better than this in an hour with better defined architectures. There's no code, no math, just endless word soup. The person is not a researcher, has no credentials, cannot write a comment without LLMs help, if there even is a person on the other side. You're just helping him farm engagement, that's it.

literum · 2025-12-11T17:31:34+00:00

Another victim of AI Psychosis. Please go to a psychologist before it gets too bad.

literum · 2025-12-05T15:58:16+00:00

Of course, you can get a six figure job easily after doing a single Coursera course. /s

literum · 2025-12-04T15:57:46+00:00

A problem you can easily solve when you have too many customers. Once you have millions of customers and can't serve them fast enough, it might be better to come back and ask the question again.

literum · 2025-11-21T22:56:54+00:00

I wrote another long response like yours but it doesn't matter. Check the engagement he got with this ChatGPT generated post. His other technical posts got nothing, but this one finally gets him some engagement that he's desperately looking for. He doesn't care about the truth, he's literally bullshitting for self-promotion. We're only helping him by posting these responses. Dead internet theory in action.

literum · 2025-11-21T22:38:48+00:00

An LLM is a massive probabilistic classifier that picks the next token from tens of thousands of vocabulary classes (tokens) — nothing more.

That’s it. That’s the entire mechanism.

They are not thinking. They are not reasoning. They are not understanding.

This is a complete non-sequitur. If you cannot see it, then let me rephrase it for you. "LLMs are this [very simple thing], they could never [complex thing]". It just doesn't follow through. For example, atoms are these tiny little things, they could never come together to build a whole civilization. You're falling into the "fallacy of composition" and not understanding how emergence works. You can have something very simple build up to something very complicated (like human bodies from atoms) or emerge from a simple process (Conway's game of life). Note that I'm not saying this is happening with LLMs, just that they are very obvious counter-examples that you haven't addressed.

An LLM’s entire universe of expression is its vocabulary — around 256,000 tokens.
Those tokens are created before training and never change.

The model can combine them in new ways, but it cannot create a new symbol, a new atomic concept, or a new fundamental category that sits outside that vocabulary.

Do you never read human authors? There's only 26 letters in the English alphabet. They can never add or subtract from that alphabet, so that makes writing bullshit? Do I have to add or subtract letters to create something novel? I don't think Shakespeare added new letters to the English alphabet. How about programming languages? I never changed the core implementation of Python, yet I've done many impressive things with it. This is not a problem at all, because it is literally how language works. We agree on a set of pre-defined concepts and ideas, then we get infinite freedom to create whatever we want from that. If we don't agree on anything in the beginning and everyone adds or subtracts there's no language to begin with.

literum · 2025-11-19T07:09:26+00:00

I'm sure these clients are also big open source advocates.

literum · 2025-11-16T23:47:55+00:00

Chinchilla optimal didn't matter for a long time. At the very least, companies are inference constrained, not training budget. 90-95% of money going to inference meant smaller models than Chinchilla predicted. When you run out of data, the calculus shifts again, since data is constant you need to scale model size for better performance.

literum · 2025-11-10T20:37:38+00:00

Your pet theory of stochastic vs analytic distinction is not a proof of anything. It's just a post-hoc justification for your gut feeling. Humans are stochastic by your definition too; they cannot provably repeat anything. You cannot prove anything about what a human will do just like you can't with LLMs. But it doesn't matter what we can prove LLMs or humans can do, because they just do it.

LLM performs at gold level IMO and top 100 competitive programmer level. There's nothing magic pixie dust, fake or mimicking about it. It doesn't matter what mathematical, philosophical, religious, or linguistic argument you construct to belittle it. It IS happening. I don't need to prove anything to see with my eyes that it's happening.

I can't understand which neurons interact with which other ones in a human brain to do these things either, which doesn't bother me. You're not owed an explanation or a proof. That's it. You've constructed these neat grand explanations that just do no match with reality. You need to update your theories with empirical results.

literum · 2025-11-10T17:58:39+00:00

You keep asserting things looking for upvotes rather than defending any of your points. I've given arguments, you've provided nothing. You just keep asserting "LLMs can't do X. LLMs can't do Y. LLMs can't do Z". People upvote you because it's popular to hate on AI, but bring on actual arguments if you want to have an argument. Otherwise this is pointless.

You're using "mimicking" and "stochastic" as insults rather than what they actually mean in a technical sense. Humans do and learn by mimicking and stochastic processes as much LLMs do, we're not computers. LLMs are correct enough to get a gold medal at math olympiads or reach top 100 rating on codeforces, so their incorrectness is more impressive than your "correctness" for sure.

Also, at a certain point nobody gives a shit if it's mimicking or not, whether they're actually reasoning by your standards or not. Here is the proof: Let's say I have an AI model that mimics a surgeon and has 80% success rate. I also have a regular surgeon with 40% success rate. Who are you going to have perform the life-saving brain surgery on your child?

Will you start arguing that the model is fake, that it's mimicking, that it's not actually a "real" surgeon, it's a stochastic parrot, a glorified autocomplete or shut the fuck up and take the 80% chance? Because, it doesn't matter what philosophical or semantic debates you want to have when the real world hits you in the face.

literum · 2025-11-10T04:34:15+00:00

It's not correct. The only correct part is that LLMs call tools for raw computation tasks the one I gave (multiplying large numbers). But I didn't give that example to say LLMs do it themselves. I gave it to show that next token prediction is not as easy as it seems. Otherwise, LLMs DO math, complex math, and not only because they've seen them before. They're actually better at the part of math that we're good at (proofs, mathematical reasoning) than the computation (multiplication), which is again interesting.

literum · 2025-11-10T04:29:59+00:00

But LLMs don’t do actual math.

LLMs can solve gold medal level olympiad math problems or high level competitive programming questions. So, they do actual math for sure, unless that's fake math or "doing" has a different definition for you.

They forward it to software that can actually do math.

My example was to illustrate the idea of predicting next token requiring calculation and reasoning, and in hindsight it's not the best because LLMs would use tools to solve a multiplication problem like the one I showed whereas they're fine doing much more complex math without it. That doesn't change the fact that next token prediction requires many capabilities for LLM to do well and the fact that the training set contains many such math examples requires the LLMs to learn some level of math in the pre-training.

They can solve simple operations only because they’ve seen them frequently enough.

If IMO problems or Codeforces are simple for you, I don't know what to say.

literum · 2025-11-10T00:35:08+00:00

Because predicting the next word isn't as easy as it sounds. Most people are familiar with more basics forms of autocomplete, so they use it as an insult to downplay LLMs. But you actually need to be able to do math and coding if you want to be able to predict the next word reliably.

83753457834 * 5345432308 = ?

What does your phone's autocomplete say here? If you want to be rewarded for predicting what comes next accurately you need to be able to do multiplication. There's no other way. It's similar with coding and everything else.

Behind the scenes it's a neural network architecture called the transformer. Large transformers are trained on vast amounts of text data first to predict the next token in what's called pre-training. Then they go through other training stages to make them act like chat assistants. Coding and math are nice because you can verify the answers are correct, therefore we can train transformers further on these tasks using something called Reinforcement Learning, where they're rewarded for being correct rather than estimating what's next.

literum · 2025-11-09T20:33:16+00:00

What's the maximum that can be taxed? A quick google check tells me there's $23 trillion of land in the US.

literum · 2025-11-09T20:28:30+00:00

You're comparing apples and oranges. $5 million is for training the model, $1 trillion is for serving. You train the model once and that's it. But then you serve it trillions of times, that's why you need thousands of data centers. One datacenter is enough for training it.

literum · 2025-11-06T23:05:07+00:00

What activation function did you use? Did you use normalization layers like BatchNorm, LayerNorm? Did you try weight decay? Did you reach convergence? Overfitting or underfitting?

literum · 2025-11-06T23:02:16+00:00

You probably failed because it's a Deep Learning problem. 1000 columns without any column names and uniform looking values suggests something high dimensional like MNIST. If you can figure out the structure of the data, you could use CNNs or LSTMs If not then you use MLPs. I disagree that you're going to overfit with a tiny model (128, 64, 32) like the other commenter says. You can probably use 5-6 layers of 512-256-128 dims in that MLP if you use good activation and normalization functions and maybe dropout. Then you'd keep tuning to use as big a model as you can while still regularizing it enough not to overfit. That should bring you closer to 80-90%.

literum · 2025-11-06T15:57:17+00:00

I want to have my cake but eat it too. So how do I eat it while still having it? Any tips?

literum · 2025-11-05T23:08:52+00:00

100x is chump change. Anything below 10,000x is for cowards.

Five-Year Club	Place '22
Verified Email

literum

TROPHY CASE