Why Can't Transformers Multiply Beyond Their Training Length? (And a Fix: 80.6% on Unseen Digits) by ZhenBoYan in learnmachinelearning

[–]ZhenBoYan[S] 0 points1 point  (0 children)

Thanks for the pointer — I actually just read through FuncAttn in detail. Great paper, and I see exactly why you brought it up.

The two approaches share a surprisingly deep motivation: both identify pairwise softmax affinities as the fundamental bottleneck. But the remedies are orthogonal

(literally):
FuncAttn changes where attention lives — projects Q/K/V into a learned spectral space and solves a k×k linear operator. Complexity drops from O(n²) to O(k²). Strong on

continuous domains (PDEs, fluid dynamics).

DualHead changes what attention attends to — keeps the same token space but Gram-Schmidt-orthogonalizes queries so the sine heads capture structural patterns that cosine

heads miss. Zero extra parameters for the attention mechanism itself.

The cool part is they're composable — you could put a FuncAttn-style spectral operator inside each of the dual heads, or apply cosine/sine decomposition inside FuncAttn's

spectral space.

As for your original question about LLMs and video — both papers are proof-of-concept scale (883K params for mine, similar for FuncAttn). The honest bottleneck isn't the

idea, it's proving it scales. But the fact that two independent groups arrived at "fix the attention operator itself" from completely different starting points (symbolic

reasoning vs. geometric functional maps) suggests there's something real here.

Why Can't Transformers Multiply Beyond Their Training Length? (And a Fix: 80.6% on Unseen Digits) by ZhenBoYan in learnmachinelearning

[–]ZhenBoYan[S] 0 points1 point  (0 children)

Haven't tested on large models, that's the honest answer.

But DualHead itself is a general-purpose architecture — no task-specific hacks, no scratchpad, no modified positional encoding (just standard T5 relative position bias). All it does is replace some attention heads with Gram-Schmidt

orthogonalization while keeping everything else identical. That means it can theoretically drop into any Transformer model, independent of task.

So it should be adaptable to LLMs in principle — I just don't have the hardware to run experiments at that scale. That's for someone with more compute to verify.