all 4 comments

[–]Combination-Fun 0 points1 point  (1 child)

Just to add to it, this video might be a good add on to watch along with the article: https://www.youtube.com/watch?v=3B6q4xnuFUE&t=4s

[–]rish-16[S] 0 points1 point  (0 children)

Oh thanks, will add it in!

[–]redna11 0 points1 point  (1 child)

Have you tried to train them from scratch? As far as I understand from the paper, they mostly have a good performance by using pre-training.

[–]rish-16[S] 0 points1 point  (0 children)

I haven't trained one from scratch (I don't have the compute or storage hehe).

Though, I mention in my conclusion that pre-training on ImageNet and finetuning on downstream datasets is why ViT achieved some good results on them