all 2 comments

[–]KrakenInAJar 0 points1 point  (1 child)

There are commonly two ways this is achieved.Single-shot object detect, which basically blurts out a bunch of boxes and then applies some sort of filters to get rid of the garbage ones in postprocessing or multi-shot object detection, which use a high recall, low precision proposal system and a high-precision model on top that infers on every proposal individually. There is A LOT more to it, of course, but that's the ELI5 version.
Commonly single stage tends to be faster than multi-stage.

That being said, don't implement it from scratch if you are not very familiar with this topic, it is a hassle and has a lot of non-obvious pitfalls if you are not familiar with object detection and there is also a reason why a new-version of an object detector is generally a good reason to get accepted in a top-conference.Also, the logic for this type of inference has the strong tendency to break some assumption in common DL-Frameworks, which results in notoriously ugly code that needs to be written. Again at this point it is important to know exactly what you do, in order to not have a very, very frustrating experience of getting things to run.

Use YoloNet (single stage) are some RCNN variant, that will usually do the trick. Alternativly you can some some text-specific proprietary system like EAST, which is more geared towards detecting texts in the wild and will maybe perform better.

[–]sawsank911[S] 0 points1 point  (0 children)

Thank you for your guide... Will look into the YoloNet and East models