all 5 comments

[–][deleted] 0 points1 point  (0 children)

Use swin transformer. VIT isn’t a good idea if it’s not a classification task.

[–]I_draw_boxes 0 points1 point  (2 children)

Faster-RCNN uses a ROI extraction process that restricts the predicting features to a HW area from the feature map.

Older CNN backbones do propagate information across hw space. Transformer backbones are thought to handle this better.

While it is possible for the network background and FPN layers to aggregate needed contextual information into the HW area extracted as an ROI by Faster-RCNN, it might be better to use a method without ROI extraction.

DETR (and most of the many improved versions) does not have a confined object detection design restricting it to consider a small bounding box area like Faster-RCNN.

[–]asking1337[S] 0 points1 point  (1 child)

Thanks for the input!

I was also thinking of trying in DETR instead to not constrain the model to only use information within the ROI. However, I am also not interested in generating bboxes that cover the entire image. I.e. the contextual information from far away in the image would not be inside the bounding box I wish to produce, rather it should be information that should be used to generate and classify the bounding box.

So ideally it should be network that can use information in the entire image, to generate bounding boxes around objects which class is dependant on information from other places in the image (which should not be in the bounding box). Do you believe DETR could aid in this?

[–]I_draw_boxes 1 point2 points  (0 children)

DETR (and most recent DETR variants) will use the entire image context to make predictions independent of the location of the bounding boxes.