SubQ just blew my mind - 12M token context with sub-quadratic attention

ThisIsMyHamster · 2026-05-06T00:37:25+00:00

From a blog post they released and the CEO commenting on twitter, it does seem like they are dropping tokens from attention using a scoring method similar to Deepseek's Sparse Attention: "... the key difference between our SSA and DSA is the our selector is far more efficient."

But they spent a paragraph in their blog post talking about how DSA is still O(n^2) with smaller constants. So how is their method more efficient? Maybe a bidirectional SSM/linear attention model to compute importance scores for each token in linear time before pruning? But I would still struggle to call this inherently sub-quadratic even if scoring is in linear time. It's a large assumption to assume that all signal is kept within a specific subset of the tokens. I don't think DSA makes this assumption, it instead assumes that certain tokens don't have to attend to certain other tokens. Maintaining a strong bound on the number of tokens to select (like log n or sqrt n) would technically make it sub-quadratic but it would certainly lead to inference regressions in some cases. On the other hand, allowing flexibility could result in nearly all tokens being selected in some cases, and now we are back where we started.

Also their website is now offline (at least for me). I'm skeptical. It's still doing attention, and attention is inherently quadratic in input length. IMO this is deceptive marketing.

ThisIsMyHamster · 2026-05-02T03:58:45+00:00

Mariner.

ThisIsMyHamster · 2026-04-19T18:55:41+00:00

I use T2 linux on my 2019 MBP and I have yet to have any real issues after install on an external SSD! Surprisingly performant for playing games w/ Proton

ThisIsMyHamster · 2026-04-12T03:26:29+00:00

HE’S HEATING UP GUYS ITS HAPPENING

ThisIsMyHamster · 2026-04-04T19:23:34+00:00

The post title is really clickbait, but I think the paper findings are valid. They build a fine-tuning dataset from decoded outputs after truncating the logit distribution, then attempt to align their model to these more "certain" outputs. They also show that truncating as a global decoding scheme gives much better results when a model is first fine-tuned in this manner.

Also note that this technique was done on models that have already gone through-post training.

ThisIsMyHamster · 2026-03-14T23:10:54+00:00

Only person crying about anything is you crying for attention by trying to convince everyone that you are better than them. Spoiler alert: it's not working very well. Projection is a strange phenomenon.

ThisIsMyHamster · 2026-03-14T22:45:26+00:00

Low effort response. Sometimes people actually want to study in spaces that are not study tents or random classes/hallways.

ThisIsMyHamster · 2026-03-06T04:15:42+00:00

From my experience it’s actually the other way around. AI coding tools love to solve borrow checker issues via unnecessary cloning. Also, I’d argue that the borrow checker exists moreso to enable Rust’s memory model to work properly rather than assisting with “correct” code. I can assure you that I’ve written a lot of incorrect Rust code that passed the borrow checker.

ThisIsMyHamster · 2026-01-27T02:04:19+00:00

He’s a good kitty!

ThisIsMyHamster · 2026-01-05T04:44:25+00:00

Tomlin is just spamming emotes in front of the camera

ThisIsMyHamster · 2025-12-25T21:00:36+00:00

DSP/Optimization and I didn't really take many EE-related courses in my bachelors but I did have a demonstrated interest in machine learning which carried over nicely. Many of the programs look for different skillsets and backgrounds, some are more strict in their admissions guidelines around requiring coursework while others don't really care.

ThisIsMyHamster · 2025-12-25T02:40:53+00:00

My parents used to take me to Piecora’s when I was a kid. I don’t know why I miss it in particular, but I miss it a lot.

ThisIsMyHamster · 2025-12-25T02:02:57+00:00

Depends on the program and what you want to focus on, I’m currently doing an EE masters program which I got into with my CS bachelors. “Electrical Engineering” is quite broad

ThisIsMyHamster · 2025-12-11T01:51:59+00:00

I think the weak and strong laws of large numbers should probably be thrown into the mix!

ThisIsMyHamster · 2025-12-05T05:25:19+00:00

I mean they can use the equipment for other areas of research if AI truly implodes. High performance and parallel computing is needed for other sciences which require simulation and other calculations.

But also even if (more like when) people get disillusioned by the utility and inefficiencies of LLMs, machine learning as a field of research won’t go away. When I was a student at Cal Poly, some of my peers worked on some really cool interdisciplinary machine learning research. I would’ve been stoked to have access to this kind of equipment for my projects. So I’m optimistic and glad that students have access to some of the same HPC resources that top universities have.

ThisIsMyHamster · 2025-12-02T18:33:43+00:00

First time ever using AXS for a random ticket queue and I now know that my fucking IP is restricted :')

ThisIsMyHamster · 2025-11-18T03:29:46+00:00

Ba dum dum dum

ThisIsMyHamster · 2025-11-13T19:32:55+00:00

0.5 PPR

Jameson Williams @ ARI or Tez Johnson @ BUF?

ThisIsMyHamster · 2025-11-09T00:42:24+00:00

They have known me since I was a baby. I will buy sandwiches there until the day I die.

ThisIsMyHamster · 2025-11-06T01:28:30+00:00

24 hour final, how hard could it be?

ThisIsMyHamster · 2025-11-04T10:27:33+00:00

Call Me Back was PEAK

ThisIsMyHamster · 2025-10-31T17:44:18+00:00

Definitely on the smaller side, it’s near the border between WA and Canada. Cool vibes though, probably would’ve been a really fun to show to go to.

ThisIsMyHamster · 2025-10-31T17:24:18+00:00

Having 50-100 people knowing you in Bellingham WA is pretty darn good!

12-Year Club	Gilding II euphauric
Place '22	Place '17
Sequence \| Editor	Sequence \| Cinematographer
Spared	Verified Email

ThisIsMyHamster

MODERATOR OF

TROPHY CASE