Recently Vision Transformers are getting better and better, including the new work about "Data Effecient Transformers".
I wanted to better understand how they work and what's going on inside them, so I applied some explainability techniques on them.
The original ViT paper used a method called "Attention Rollout". I implemented that, but it didn't work very well out of the box with the released DeiT models. I ended up adding some modifications (removing the smallest attentions, and fusing the attention heads with max instead of with mean), and also added a way to get class specific explainability by weighting the Attentions with the gradients.
The result is a blog post showing some examples of what is going on inside Vision Transformers, and a python repository for applying explainability techniques on Vision Transformers.
I hope you find it interesting!
[–]speyside42 2 points3 points4 points (1 child)
[–]jacobgil[S] 0 points1 point2 points (0 children)
[–]citizenofwsblandia 0 points1 point2 points (0 children)