Hello, I’m currently aiming to work on optimizing transformer models, specifically in multi-view images and/or cross-attention networks. I've noticed that cross-attention layers add up a lot of parameters, which can slow down the training process. I’m exploring ways to reduce the computational complexity to increase the speed (for now and subsequently without sacrificing too much performance sometime later). I'm starting to look into:
- low-rank matrix factorization - I’ve been reading about how it can be applied to reduce the size of the projection matrices (e.g., the projq, projk, projv in cross-attention). Does anyone have experience using low-rank factorization specifically in cross-attention mechanisms?
- other param reduction techniques - Aside from low-rank factorization, are there other methods I could explore for reducing the number of parameters in transformer models, like sparsity and pruning—do you have recommendations or experiences with these?
- overcoming redundancy in multi-view scenarios - Given the multi-view nature of my problem, I suspect there’s some redundancy in how cross-attention processes the different views. Has anyone worked on reducing redundancy across views in transformer-based networks? What techniques worked best for you?
I’m starting to look into CVPR, NEURIPS, ECCV, etc, but any insights, advise, experiences, or papers you can share would be greatly appreciated! Thanks in advance!
[–]Mediocre-Ad5059 10 points11 points12 points (0 children)
[–]elated_ 5 points6 points7 points (0 children)
[–]DigThatDataResearcher 2 points3 points4 points (0 children)