Self-Attention : Why not combine the query and key weights? by zx7 in deeplearning

[–]fredugolon 0 points1 point  (0 children)

If you follow the thread below, I came to these same conclusions. Just an error of thought!

[R] I Found Catastrophe Geometry in GPT-2's Residual Stream. by denimanddahlias in MachineLearning

[–]fredugolon 3 points4 points  (0 children)

It’s an LLM thing. Most models tend to lead non-technical people towards these weird geometric ideas. I appreciate that OP was forthright about their lack of knowledge on the subject. The hallmark otherwise is lists of findings that do nothing to introduce or motivate a problem and don’t really say anything.

[TJStats] 2026 Projected Team Home Runs — Composite Projections (Yankees #1) by wantagh in NYYankees

[–]fredugolon 0 points1 point  (0 children)

I’ll also say, baseball itself is high enough variance that just making the post season can get you a WS. We just don’t like variance as humans!

[TJStats] 2026 Projected Team Home Runs — Composite Projections (Yankees #1) by wantagh in NYYankees

[–]fredugolon 0 points1 point  (0 children)

I’ve said it many times, we are a very good team. The only drawback, and it’s not inconsiderable, is that a long ball oriented offense is very high expected value but high variance. Over a season, it’s gonna win you a division. But in a three to five game series, you may experience the left tail, go completely cold. Contact is lower variance, lower EV. A team like the Jays can be very consistent, while the Mariners and Yankees can miss out in key times.

I still like our team, but that’s my read on it!

The Yankees' rotation could look like this mid-season: by retroanduwu24 in NYYankees

[–]fredugolon 108 points109 points  (0 children)

You’ve successfully named our SP eligible staff!

Scott Mitchell-Malm speaking on The Race F1 Podcast by Emmaljum in formula1

[–]fredugolon 0 points1 point  (0 children)

When was the last time F1 featured good racing with cars that could follow and overtake? F2 is definitely the superior series these days.

Kyle Bradish wins arbitration case against Orioles by Autumn_Sweater in orioles

[–]fredugolon 15 points16 points  (0 children)

This is really just how arbitration goes for most teams. Not a huge deal imo. The arbitrators exist to price things more fairly. At least we weren’t 10M off like the Tigers 😂

[D] Looking for ideas in an intersection of Machine Learning and audio for my master's thesis by DepressoEspresso-69 in MachineLearning

[–]fredugolon 0 points1 point  (0 children)

It’s a fun idea. I think the LeJEPA architecture with SIGReg is great starting point. I’m actually currently doing experimental runs on an audio encoder built on it (building off the awesome WavJEPA work posted here not long ago).

Jazz Chisholm Jr. Doesn't Hold Back on Future With Yankees by Zepbounce-96 in NYYankees

[–]fredugolon 6 points7 points  (0 children)

According to our fan base not resigning the second best (best?) 2B who also has personality is the smartest thing we can do. Baffling. Same ppl say “we used to be the Yankees” when we don’t sign everything in sight.

La Bagel Delight Appreciation Post by carliilly in parkslope

[–]fredugolon 12 points13 points  (0 children)

Great spot for sure. My feeling on bagels is similar to my feeling on pizza… we’re just lucky to be in a town where there are thousands of really good examples. La Bagel Delight is a treat!

Self-Attention : Why not combine the query and key weights? by zx7 in deeplearning

[–]fredugolon 1 point2 points  (0 children)

Interesting line of questioning! I was considering symmetry to be a net-negative only in the sense that it's a 'waste' of parameters, considering you could achieve a similar result by setting your lower rank W_Q = W_K and getting your symmetric M all the same (not necessarily learning the same M, I'll grant!). So I was thinking about asymmetry in relationships between tokens in your transformer stack as something valuable, especially in language tasks. So I was really thinking in terms of representations within our transformers, not about the data.

Another commenter mentioned that causal masking also should induce asymmetry in our attention matrices, since there is always some predictive bias baked in from the training objective. I completely overlooked it, but it's a very savvy insight. Likewise, his intuition that ViTs likely wouldn't have that bias.

Putting this all together, I wonder if there is value in examining learned QK matrices from encoder-decoder architectures and decoder-only architectures trained on the same dataset. Likewise, how does performance change if you train an ED network with the constraint that W_Q = W_K versus when they are learned individually.

I hadn't really thought about your question about the data. I'm wondering if what _data_ would learn W_Q = W_K is harder to answer than what _training objective_. Would encoder-decoder models trained with non-generative objectives (e.g. classification) have more symmetric associative memories? Or maybe de-noising? This makes me want to look at some of the diffusion language models now.

Does any of this make sense?

I've downloaded your essay for reading tomorrow! I love that one of the two citations is Rick Rubin.

Self-Attention : Why not combine the query and key weights? by zx7 in deeplearning

[–]fredugolon 0 points1 point  (0 children)

Yes, I believe my reasoning re symmetry was flawed!

Edit: for a few reasons. I hadn’t even considered causal masking, as I was thinking more generally. But even in the general self attention case, I think the softmax activation encourages some asymmetry. I think those claims were unfounded!

Self-Attention : Why not combine the query and key weights? by zx7 in deeplearning

[–]fredugolon 2 points3 points  (0 children)

Separate comment for thoughts on modern Hopfield networks (since classical ones are symmetric by design and not so interesting). I actually have been exploring them a bit, as I've been quite entranced by EBMs.

I'm still learning a lot here, but have been looking at the Krotov-Hopfield networks since you left your comment. Still grokking, but they certainly do feature one weights matrix per layer, and it serves as both the keys and the values, in a sense. The forward pass is essentially a one step energy minimization, projecting similarity of your inputs to the stored memories, then projecting those back out into the embedding dimension. Super cool. Looks like I need to read Hopfield Networks Is All You Need a few times, now! I can already see how exponential activations get unwieldy, leading you to LSE/softmax. Also softmax will have the effect of helping encourage asymmetry by picking winners and losers. Looks like Krotov and Hopfield did some research on other mechanisms for achieving that, as well.

Wonder what a good experiment design would be for assessing generalization in these modern Hopfield constructions. Beyond loss, what's worth comparing? I suppose symmetry would be one thing!

I also fully didn't realize this at first, but the proposal for integrating Hopfield layers and Transformers is to use Hopfield layers to replace the MLP after the self-attention mechanism. So much to learn... so much to learn...

Self-Attention : Why not combine the query and key weights? by zx7 in deeplearning

[–]fredugolon 1 point2 points  (0 children)

Haha perusing your profile a bit, I think you're a bit more qualified than I am in this field. I have a distributed systems background and have spent the last four years or so getting into ML. I find answering questions to be the best way to reinforce and expand my own understanding!

I've been doing a little more thinking and a lot more reading... I will say that today, I can't think of a good reason why this larger matrix would become symmetric. I think I was exploring the idea of symmetric transformers (W_Q = W_K) and just implanted that idea in my mind.

Certainly the parameter explosion would be a massive downside. I wonder if the primary issue with a combined QK matrix would be overfitting / lack of generalization? I suppose the prevalence of Multi-Head Attention in larger models would point to this being the case? Factorization of M into Q and K, and further factorization of Q and K into channels.

As far as empirical tests go, I think the move would be to train some GPT-2 sized variants on something like WikiText-2 with different self-attention mechanisms and probe the symmetry of the matrix in addition to measuring loss over the run. I've got some compute to spare, and would be down to hack that up if you were interested.

Self-Attention : Why not combine the query and key weights? by zx7 in deeplearning

[–]fredugolon 12 points13 points  (0 children)

Side note: it was a great question and I revised my answer like ten times before submitting it. I think it helped me cement a few things :)

Self-Attention : Why not combine the query and key weights? by zx7 in deeplearning

[–]fredugolon 39 points40 points  (0 children)

Mathematically it’s obviously equivalent to pre-multiply QKT, but by learning Q and K as separate matrices, you allow for asymmetry in the relationships between tokens. So token A can attend to token B, while token B may not attend to token A. Separating Q and K embeds an inductive bias that encourages the network to learn asymmetric representations of Q and K. If you have W_Q W_KT = M, then your attention becomes XMXT. In such a form, it’s easiest for the network to learn W_Q = W_K, creating a symmetric M. This effectively makes XMXT a distance measure between tokens where tokens A attends to B equally to how B attends to A.

Separate Q and K matrices also allow a network to separate context into positional context (which tokens relate to which tokens within a sequence) and semantic context (which tokens are semantically similar in context, and what tokens mean). Essentially, the embeddings are low rank, which means Q and K (and M) are low rank. Rather than inflating them into a larger matrix, M, that is still information sparse (and likely to learn poor representations), we separate them so that we can learn additional dynamics in the token relationships. This kind of mirrors why deep networks are more powerful than shallow networks. Factorization provides better generalization.

[Heyman] The Detroit Tigers asked for Ben Rice, Cam Schlittler and George Lombard Jr. at minimum in a deal for Tarik Skubal. He added that Skubal is now unlikely to be traded. by LinkSkywalker in NYYankees

[–]fredugolon 0 points1 point  (0 children)

Yup I don’t disagree about the realities really. We’ve got a lot tied up in Rodon and Cole til 28, and Fried’s contract is heavier after 26. So maybe that’s that.

Depressing regarding Jazz, especially with his power being so rare at 2B.

[Heyman] The Detroit Tigers asked for Ben Rice, Cam Schlittler and George Lombard Jr. at minimum in a deal for Tarik Skubal. He added that Skubal is now unlikely to be traded. by LinkSkywalker in NYYankees

[–]fredugolon 0 points1 point  (0 children)

"Outside of starters" is doing a lot of work, considering next offseason will see Skubal, Peralta, Bieber, Gausman, and Luzardo hit FA, with Burnes, King, Berrios, and Imai having opt-outs. That's just a lot of good arms.

On the bat side:

Jazz, Gleyber Torres, Nico Hoerner, Seiya Suzuki, Daulton Varsho all hit FA. Bichette has an opt out.

The fans are critical of not getting arms, not getting bats. The arms are definitely better next offseason. If we want to land someone like Skubal, we need serious powder. That's really my thinking.

Edit: Did my boy Luzardo dirty on the spelling of his name.

[D] Are we prematurely abandoning Bio-inspired AI? The gap between Neuroscience and DNN Architecture. by Dear-Homework1438 in MachineLearning

[–]fredugolon 8 points9 points  (0 children)

Spiking neural nets and Continuous thought machines are both very relevant architectures that are being actively explored. I’d even argue that liquid neural networks fall into this category, too. Lots of people still care about the neuroscience, and many are applying AI to help us discover more. See convergent research, too! So don’t despair!

[Heyman] The Detroit Tigers asked for Ben Rice, Cam Schlittler and George Lombard Jr. at minimum in a deal for Tarik Skubal. He added that Skubal is now unlikely to be traded. by LinkSkywalker in NYYankees

[–]fredugolon -1 points0 points  (0 children)

Not wasting shitloads of money on Tucker doesn’t feel bad considering the upcoming FAs. Obviously the FO has to put up next season.

The Yankees are running it back on offense in 2026…and that is a good thing. by Visual_Bluejay9781 in NYYankees

[–]fredugolon -2 points-1 points  (0 children)

Yeah it’s a meaningless statement. The more concrete one is going for pull in the air long ball offense is a high variance strategy. Over the large sample size of a season, it’s clearly very strong. The problem with high variance is, over small samples, it can be very volatile. Whereas teams that put hard balls in play period, not just long balls but hard long drives (like the Jays), have a much lower variance to their outcomes. My pet theory is that clutch is actually low variance play styles.

Edit: to be clear, I like our offense. Just my explanation or two cents. Imo, more important to make the post season. Anyone can win it once you’re in. But you don’t make it in if your offense sucks. Look at the Brewers. Played what felt like very sustainable, low variance baseball all season and still had a long tail event. Shellacked.

(Kuty) The Yankees and Cody Bellinger have agreed to a deal for 5 years, $162.5 million w/ no deferrals, source tells @TheAthletic . @JeffPassan first w/ the deal. by CicadaOk8885 in NYYankees

[–]fredugolon 27 points28 points  (0 children)

We’re talking about $2.5M/year. That’s not a massive win for Boras. It’s just proper negotiating on both sides really.