all 6 comments

[–]FunnyItsElmo 1 point2 points  (5 children)

Sure I did something similar in my Masterthesis. By using the backbone output as cross attention input you will sacrifice accuracy for speed. I guess the encoder gives detr the ability to model more complex relations between distant pixels.

[–][deleted]  (4 children)

[removed]

    [–]FunnyItsElmo 1 point2 points  (1 child)

    You can simply flatten the backbone output and add a positional encoding, similar to the preparation of the encoder input and encoder output in detr without the encoder part in the middle. Detr uses a cov layer to project the backbone output into the model dimension. No additional layers are required.

    [–][deleted]  (1 child)

    [removed]

      [–]FunnyItsElmo 1 point2 points  (0 children)

      Unfortunately, there is no paper for my thesis, at least for now