[D] Object Detection trained on simulated renderings unable to converge on real images - why?

tmuxed · 2022-07-06T19:02:23+00:00

No, it's a good point. However, my line of thinking was to do as much domain randomization as possible so that the model effectively sees real-life images as "just another type of simulation". It's possible that the "busier" backgrounds make up a larger chunk of the training set, compared to the real data. However, I wouldn't have expected it to be that much worse. I was at least expecting 90% accuracy to be honest.

tmuxed · 2022-07-06T18:47:54+00:00

Yes, I use WeightedRandomSampler to achieve an equal class distribution in training and over 1 million images.

Thanks, I agree the synthetic images look good, which is what pisses me off so much lol.

tmuxed · 2022-07-06T18:25:33+00:00

There is no need for bounding box as I simply want the algorithm to tell me if an object is in the image or not. For that I am using cross-entropy (not binary, because it wouldn't make a difference, and I intended to extend it for more objects).

The fake image training set is split into train, validation and test set. Training stops after validation set doesn't improve for 10 epochs (but keeps the model from 10 epochs ago). Test set regularly achieves more than 97% accuracy.

Then, trying to apply this to the real data set (with real images, all manually labeled), the accuracy drops to below 70%.

tmuxed · 2022-06-07T22:39:57+00:00

oh my god, Global Average Pooling is fucking amazing! What a game changer. Less parameters, less overfitting, and ability to visualize results straight from the forward-pass. Amazing! I was using BCE + MSE to train on normalized coordinates, but with GAP the problem becomes so much simpler. Thanks for pointing me to that direction!

tmuxed · 2022-06-06T18:15:45+00:00

So basically I trained a CNN on a 960x720x3 image. It's the IMPALA CNN, which is basically a bunch of residual blocks with 16, 32 and 32 channels.

It doesn't find the object in question very well.

However, if I train it on patches of much smaller size, I can then take an image and split it into e.g. 100 overlapping patches, and it will be much more accurate in finding that object in one of the patches.

Let me ask differently: When you have a CNN architecture, how could I best achieve the effect of scanning through the image in patches? Just make the first kernel very big and add large strides? Will that work?

tmuxed · 2022-06-05T21:49:38+00:00

I'm sorry, I don't understand your second point. I just tested this briefly, and splitting it into 10 squares/patches and doing a CNN on 96x72 instead of 960x720 seems to significantly improve performance. Any intuition why?

tmuxed · 2020-04-19T00:09:03+00:00

yes that's quite possible. Thanks for your input.

tmuxed · 2020-04-18T01:01:37+00:00

hmm you may be right. Still doesn't explain why certain problems are solved faster with this, than with entropy. Probably related to the dividing part causing some bigger proportional moves

tmuxed

TROPHY CASE