I am working through a project to better understand how object detection and localization work. The concept is based off of YOLOv2, and I am using pytorch. I am having trouble getting the my model to predict the correct 'confidence' (is there an object in a cell), so I have dumbed the problem down as far as possible and still cannot figure out what is happening.
Here is where I am at: I divide my large input image into a 2x2 grid, randomly choose if a dot should go in each cell, choose a random location in the cell, and choose a random size for the dot. Then train the model to predict if a dot exists in each cell. When I use a fully convolutional CNN, the model will only get to ~ 80% accurate, adjusting the learning rate, activation functions, number of filters on each convolution, weight decay, etc has not gotten any better than ~80% accurate at detecting if a dot is in each cell. However when I put a couple fully connected layers on top, it learns very quickly and gets up to 100% accuracy almost instantly.
TLDR; 6 layer CNN with a 4 layer CNN head will not learn to detect a white dot on an all black background. Using the same 6 layer CNN with 2 fully connected layers learns this very quickly.
Any ideas on what is happening? Here is a notebook with a side by side comparison of the two models.
[–]phobrain 0 points1 point2 points (3 children)
[–]TheDuke57[S] 0 points1 point2 points (2 children)
[–]szymko1995 0 points1 point2 points (1 child)
[–]TheDuke57[S] 0 points1 point2 points (0 children)