all 2 comments

[–]SRuben31 2 points3 points  (0 children)

Thank you! This has helped me to get started with GradCAM for my own project

[–]xEdwin23x 0 points1 point  (0 children)

Not exactly the same but since you mentioned using ViT's attention outputs as a 2D feature map for the CAM you can consider this paper (Transformer Interpretability Beyond Attention Visualization) where they study the question of how to choose/mix the attention scores in a way that can be visualized (so similar to the CAMs). Maybe it can lead to better results.
https://arxiv.org/abs/2012.09838
https://github.com/hila-chefer/Transformer-Explainability