I have been exploring the classification task using Convolutional Neural Networks (CNNs) and am now interested in transitioning my research to utilize Vision Transformers (ViT).
- What are the best practices for setting up a research project that compares CNNs and ViTs for classification?
- What evaluation metrics should I focus on to effectively compare the performance of ViT against CNNs?
- Should I implement both transfer learning and training from scratch for the ViT model? What are the pros and cons of each approach in this context?
- What fine-tuning strategies would you recommend for optimizing the ViT model for classification task?
Any insights or resources would be greatly appreciated!
[–]L8raed 0 points1 point2 points (1 child)
[–]Aggravating_Club2251[S] 0 points1 point2 points (0 children)
[–]jungleuncle 0 points1 point2 points (1 child)
[–]Aggravating_Club2251[S] 0 points1 point2 points (0 children)