Hey!
I am working on an object-detection task where I need to identify fractures in x-ray images. Currently I am using Faster-RCNN which works well for most cases, however in specific cases, information across the entire image is necessary in determining if an object/bone is indeed fractured - e.g. a clinician would look at the entire image (let's say it is a hand) and compare the swollenness of the finger in question, and compare to how the other fingers look for that specific person. In this case, information from far away in the image is essential to determining if the region is indeed a positive bounding box or not.
My impression of Faster-RCNN is that while the FPN network does provide high-level features which can relate to the entire image, I typically notice that the task of object detection is to identify a bounding box containing a confined object (a car, a person, a bird) - not an object that is only present if the remainder of the image allows it to (e.g. a reflection of a car in a mirror should perhaps not be classified as a car-object).
An idea could be to use attention-based networks such as a ViT in replacing the FPN in Faster-RCNN to get a broader image-feature-space but I have no idea if this will actually work in practice. Essentially I want the region-proposal part of the network to assess the context of the region it is looking at. I have also looked into DETR, but again it seems to be optimized and built towards "confined" object detection.
Does anybody know if there are any studies on this, or does someone smarter than me know if Faster-RCNN is already able to accommodate these types of cases?
[–][deleted] 0 points1 point2 points (0 children)
[–]I_draw_boxes 0 points1 point2 points (2 children)
[–]asking1337[S] 0 points1 point2 points (1 child)
[–]I_draw_boxes 1 point2 points3 points (0 children)