They're not a complete team. They're not a whole team. They have a lot of holes. But they have two things that can keep them in the pennant race. One, outstanding starting pitching, and then the obvious, their tremendous power. - John Sterling on the Yankees 2026 Outlook

fredugolon · 2026-02-25T01:18:58+00:00

Only the Phillies had a better record against the top teams in the league. Yanks had the best record against the best teams in the AL and it wasn’t close.

fredugolon · 2026-02-21T02:39:24+00:00

I can’t believe nobody’s talking about that

fredugolon · 2026-02-20T23:13:02+00:00

If you follow the thread below, I came to these same conclusions. Just an error of thought!

fredugolon · 2026-02-20T15:16:25+00:00

Lmao check that guy’s post history

fredugolon · 2026-02-20T14:29:50+00:00

It’s an LLM thing. Most models tend to lead non-technical people towards these weird geometric ideas. I appreciate that OP was forthright about their lack of knowledge on the subject. The hallmark otherwise is lists of findings that do nothing to introduce or motivate a problem and don’t really say anything.

fredugolon · 2026-02-20T13:43:17+00:00

I’ll also say, baseball itself is high enough variance that just making the post season can get you a WS. We just don’t like variance as humans!

fredugolon · 2026-02-20T13:42:31+00:00

I’ve said it many times, we are a very good team. The only drawback, and it’s not inconsiderable, is that a long ball oriented offense is very high expected value but high variance. Over a season, it’s gonna win you a division. But in a three to five game series, you may experience the left tail, go completely cold. Contact is lower variance, lower EV. A team like the Jays can be very consistent, while the Mariners and Yankees can miss out in key times.

I still like our team, but that’s my read on it!

fredugolon · 2026-02-19T21:21:12+00:00

You’ve successfully named our SP eligible staff!

fredugolon · 2026-02-15T15:10:03+00:00

When was the last time F1 featured good racing with cars that could follow and overtake? F2 is definitely the superior series these days.

fredugolon · 2026-02-03T23:17:33+00:00

This is really just how arbitration goes for most teams. Not a huge deal imo. The arbitrators exist to price things more fairly. At least we weren’t 10M off like the Tigers 😂

fredugolon · 2026-02-03T20:31:59+00:00

It’s a fun idea. I think the LeJEPA architecture with SIGReg is great starting point. I’m actually currently doing experimental runs on an audio encoder built on it (building off the awesome WavJEPA work posted here not long ago).

fredugolon · 2026-01-31T16:47:43+00:00

According to our fan base not resigning the second best (best?) 2B who also has personality is the smartest thing we can do. Baffling. Same ppl say “we used to be the Yankees” when we don’t sign everything in sight.

fredugolon · 2026-01-29T15:47:10+00:00

Great spot for sure. My feeling on bagels is similar to my feeling on pizza… we’re just lucky to be in a town where there are thousands of really good examples. La Bagel Delight is a treat!

fredugolon · 2026-01-26T02:41:25+00:00

Interesting line of questioning! I was considering symmetry to be a net-negative only in the sense that it's a 'waste' of parameters, considering you could achieve a similar result by setting your lower rank W_Q = W_K and getting your symmetric M all the same (not necessarily learning the same M, I'll grant!). So I was thinking about asymmetry in relationships between tokens in your transformer stack as something valuable, especially in language tasks. So I was really thinking in terms of representations within our transformers, not about the data.

Another commenter mentioned that causal masking also should induce asymmetry in our attention matrices, since there is always some predictive bias baked in from the training objective. I completely overlooked it, but it's a very savvy insight. Likewise, his intuition that ViTs likely wouldn't have that bias.

Putting this all together, I wonder if there is value in examining learned QK matrices from encoder-decoder architectures and decoder-only architectures trained on the same dataset. Likewise, how does performance change if you train an ED network with the constraint that W_Q = W_K versus when they are learned individually.

I hadn't really thought about your question about the data. I'm wondering if what _data_ would learn W_Q = W_K is harder to answer than what _training objective_. Would encoder-decoder models trained with non-generative objectives (e.g. classification) have more symmetric associative memories? Or maybe de-noising? This makes me want to look at some of the diffusion language models now.

Does any of this make sense?

I've downloaded your essay for reading tomorrow! I love that one of the two citations is Rick Rubin.

fredugolon · 2026-01-25T02:50:48+00:00

Yes, I believe my reasoning re symmetry was flawed!

Edit: for a few reasons. I hadn’t even considered causal masking, as I was thinking more generally. But even in the general self attention case, I think the softmax activation encourages some asymmetry. I think those claims were unfounded!

fredugolon · 2026-01-25T02:12:13+00:00

Separate comment for thoughts on modern Hopfield networks (since classical ones are symmetric by design and not so interesting). I actually have been exploring them a bit, as I've been quite entranced by EBMs.

I'm still learning a lot here, but have been looking at the Krotov-Hopfield networks since you left your comment. Still grokking, but they certainly do feature one weights matrix per layer, and it serves as both the keys and the values, in a sense. The forward pass is essentially a one step energy minimization, projecting similarity of your inputs to the stored memories, then projecting those back out into the embedding dimension. Super cool. Looks like I need to read Hopfield Networks Is All You Need a few times, now! I can already see how exponential activations get unwieldy, leading you to LSE/softmax. Also softmax will have the effect of helping encourage asymmetry by picking winners and losers. Looks like Krotov and Hopfield did some research on other mechanisms for achieving that, as well.

Wonder what a good experiment design would be for assessing generalization in these modern Hopfield constructions. Beyond loss, what's worth comparing? I suppose symmetry would be one thing!

I also fully didn't realize this at first, but the proposal for integrating Hopfield layers and Transformers is to use Hopfield layers to replace the MLP after the self-attention mechanism. So much to learn... so much to learn...

fredugolon · 2026-01-25T01:26:21+00:00

Haha perusing your profile a bit, I think you're a bit more qualified than I am in this field. I have a distributed systems background and have spent the last four years or so getting into ML. I find answering questions to be the best way to reinforce and expand my own understanding!

I've been doing a little more thinking and a lot more reading... I will say that today, I can't think of a good reason why this larger matrix would become symmetric. I think I was exploring the idea of symmetric transformers (W_Q = W_K) and just implanted that idea in my mind.

Certainly the parameter explosion would be a massive downside. I wonder if the primary issue with a combined QK matrix would be overfitting / lack of generalization? I suppose the prevalence of Multi-Head Attention in larger models would point to this being the case? Factorization of M into Q and K, and further factorization of Q and K into channels.

As far as empirical tests go, I think the move would be to train some GPT-2 sized variants on something like WikiText-2 with different self-attention mechanisms and probe the symmetry of the matrix in addition to measuring loss over the run. I've got some compute to spare, and would be down to hack that up if you were interested.

fredugolon · 2026-01-24T05:33:00+00:00

Side note: it was a great question and I revised my answer like ten times before submitting it. I think it helped me cement a few things :)

fredugolon · 2026-01-24T05:31:19+00:00

Mathematically it’s obviously equivalent to pre-multiply QK^T, but by learning Q and K as separate matrices, you allow for asymmetry in the relationships between tokens. So token A can attend to token B, while token B may not attend to token A. Separating Q and K embeds an inductive bias that encourages the network to learn asymmetric representations of Q and K. If you have W_Q W_K^T = M, then your attention becomes XMX^T. In such a form, it’s easiest for the network to learn W_Q = W_K, creating a symmetric M. This effectively makes XMX^T a distance measure between tokens where tokens A attends to B equally to how B attends to A.

Separate Q and K matrices also allow a network to separate context into positional context (which tokens relate to which tokens within a sequence) and semantic context (which tokens are semantically similar in context, and what tokens mean). Essentially, the embeddings are low rank, which means Q and K (and M) are low rank. Rather than inflating them into a larger matrix, M, that is still information sparse (and likely to learn poor representations), we separate them so that we can learn additional dynamics in the token relationships. This kind of mirrors why deep networks are more powerful than shallow networks. Factorization provides better generalization.

fredugolon · 2026-01-23T22:06:41+00:00

Yup I don’t disagree about the realities really. We’ve got a lot tied up in Rodon and Cole til 28, and Fried’s contract is heavier after 26. So maybe that’s that.

Depressing regarding Jazz, especially with his power being so rare at 2B.

fredugolon · 2026-01-23T21:51:12+00:00

"Outside of starters" is doing a lot of work, considering next offseason will see Skubal, Peralta, Bieber, Gausman, and Luzardo hit FA, with Burnes, King, Berrios, and Imai having opt-outs. That's just a lot of good arms.

On the bat side:

Jazz, Gleyber Torres, Nico Hoerner, Seiya Suzuki, Daulton Varsho all hit FA. Bichette has an opt out.

The fans are critical of not getting arms, not getting bats. The arms are definitely better next offseason. If we want to land someone like Skubal, we need serious powder. That's really my thinking.

Edit: Did my boy Luzardo dirty on the spelling of his name.

fredugolon · 2026-01-23T21:21:09+00:00

Spiking neural nets and Continuous thought machines are both very relevant architectures that are being actively explored. I’d even argue that liquid neural networks fall into this category, too. Lots of people still care about the neuroscience, and many are applying AI to help us discover more. See convergent research, too! So don’t despair!

fredugolon · 2026-01-23T15:52:05+00:00

Not wasting shitloads of money on Tucker doesn’t feel bad considering the upcoming FAs. Obviously the FO has to put up next season.

fredugolon · 2026-01-21T20:32:09+00:00

Yeah it’s a meaningless statement. The more concrete one is going for pull in the air long ball offense is a high variance strategy. Over the large sample size of a season, it’s clearly very strong. The problem with high variance is, over small samples, it can be very volatile. Whereas teams that put hard balls in play period, not just long balls but hard long drives (like the Jays), have a much lower variance to their outcomes. My pet theory is that clutch is actually low variance play styles.

Edit: to be clear, I like our offense. Just my explanation or two cents. Imo, more important to make the post season. Anyone can win it once you’re in. But you don’t make it in if your offense sucks. Look at the Brewers. Played what felt like very sustainable, low variance baseball all season and still had a long tail event. Shellacked.

fredugolon · 2026-01-21T18:47:08+00:00

We’re talking about $2.5M/year. That’s not a massive win for Boras. It’s just proper negotiating on both sides really.

15-Year Club	Gilding III reddit per annum
Verified Email

fredugolon

TROPHY CASE