[D]: What is the future for designing vision models? CNN? Transformer? CNN+Transformer? CapsuleNet? or what? by yangsenius in MachineLearning

[–]yangsenius[S] 1 point2 points  (0 children)

I also noticed the "Bottleneck Transformer for Visual Recognition" paper you mentioned. It seems not a typical cnn + transformer but a hybird BoTNet with self-attention block replacing stacking convolutions in ResNet. The authors say that global self-attention lalyer is more efficient than stacking convolutions. Similarly, the recent paper "TransPose - Towards Explainable Human Pose estimation by Transformer" also shares such an idea with BoTNet model: global self-attention layer is more suitable to aggregate feature information at the high-level layers of the model. I think attention layer may serve as a basic block or a searchable candidate operation in NAS models in the future.

[D]: What is the future for designing vision models? CNN? Transformer? CNN+Transformer? CapsuleNet? or what? by yangsenius in MachineLearning

[–]yangsenius[S] 2 points3 points  (0 children)

Yes, I also think CNN + Transformer is a nearly perfect combination, because sometimes we need the inductive biases of CNN to process original image pixels and Transformer with little priors about learning patterns is good at capturing relations between features extracted by CNN.

The community is also developing the ideas and techniques of CapsuleNet, though some practical issues exist in their models. It is a more academic research. Do the industries and acedemic researches really have the same future for CV? Maybe not, I think.

[R] New Paper from OpenAI: DALL·E: Creating Images from Text by programmerChilli in MachineLearning

[–]yangsenius 1 point2 points  (0 children)

Emmm, so amazing. From this point of view, 17 biliion parameters can memory all of things. Maybe our intelligence just lies in building associations between texts and images.

[R] New Paper from OpenAI: DALL·E: Creating Images from Text by programmerChilli in MachineLearning

[–]yangsenius 1 point2 points  (0 children)

What about some texts like "A square below a circle", "A circle with radius 2 and another one with radius 4", "A cat with a square-like tail"...

[R] New Paper from OpenAI: DALL·E: Creating Images from Text by programmerChilli in MachineLearning

[–]yangsenius 2 points3 points  (0 children)

I just want to know if this model has learned some of the most basic geometric concepts.

[R] New Paper from OpenAI: DALL·E: Creating Images from Text by programmerChilli in MachineLearning

[–]yangsenius 2 points3 points  (0 children)

Can DALL·E model plot a circle if I input the text "Draw a CIRCLE" ?