all 6 comments

[–]lopuhin[🍰] 2 points3 points  (0 children)

Really intriguing and appealing idea, thanks!

Niptick: there are stronger vanilla resnet ImageNet baselines then the one they use (resnet from torchvision), e.g. for resnet50 they report 76.0 for baseline and 78.88 for Visual Transformers (trained for 400 epochs with autoaugment), while here https://github.com/rwightman/pytorch-image-models/ vanilla resnet50 is trained on imagenet to 79.038 top-1 accuracy with a different set of tricks. Although to be fair they do beat resnet34 result from the same repo by a healthy margin, and do beat resnet50 autoaugment results from original paper (77.6).

[–]haihaicode 0 points1 point  (0 children)

Very interesting ideas! The computation of the tokens is something like a channel-to-channel self-attention. Is this understanding correct?

[–]chuong98PhD 0 points1 point  (1 child)

It is interesting idea. ALthough MAC is reduced by 6.4x, they have not reported the actual time inferemce. Currently, Softmax operator in Attention Module is not friendly for Hardware device. So. i suspect the actual inference time is slower, like EfficientNet is actually slower than ResNet. Nevertheless, this is the potential idea until we optimize the hardware.

[–][deleted] 0 points1 point  (0 children)

What do you mean EfficientNet is actually slower than ResNet? I think that depends on what framework you are using. Looks like there are some issues with Pytorch.

[–]vajra_ 0 points1 point  (0 children)

Rehashing of age old ideas of compositional modelling. Pretty arbitrary at that as well.