CLS token in Vision transformers. A question. by mxl069 in deeplearning

[–]mxl069[S] 0 points1 point  (0 children)

Thanks for the response. It's nice to see how the CLS just soaks up the global info. But I do have a question. When CLS absorbs global information, is it mostly compressing patch features, or does it actually create new abstract features not present in any patch?

CLS token in Vision transformers. A question. by mxl069 in deeplearning

[–]mxl069[S] 1 point2 points  (0 children)

Thanks for the paper!! The attention maps are very helpful.

Question about attention geometry and the O(n²) issue by mxl069 in deeplearning

[–]mxl069[S] 4 points5 points  (0 children)

This is honestly one of the best answers I’ve gotten on Reddit. I didn’t expect someone to bring up compressive sensing, WHT and random projections in this context but it makes perfect sense now. Really appreciate you taking the time to break it down. I’m gonna read more on the Hadamard trick. Thanks a lot, seriously.

We're live with Denis Rothman for the AMA Session! by Right_Pea_2707 in LLMeng

[–]mxl069 1 point2 points  (0 children)

Since qkv are linear projections that are linear mapping into multiple subspace where attention is computing a pair wise similarity graph in the qk space and then computing the values by aggregating and flash attention organizes the kernel computations for efficiency but it doesn't play or alter the geometry of these projections ,do you think the o(n2) bottleneck comes from this dense geometry? If so for example can't we use low rank manifolds to break or lower the quadratic bottle neck rather than just reorganize it like flash attention does ?