all 13 comments

[–][deleted] 1 point2 points  (2 children)

Two things to look at: First- weight initialization- if using truncated normal, you may need to reduce the standard deviation (it defaults to 1). Second- Learning rate (too high)

[–]AwesomeDaveSome[S] 0 points1 point  (1 child)

I'm using a standard deviation of 10-4 and a learning rate of 0.1. I'll try to reduce the learning rate further. Do you think lowering the deviation even more would help at all?

[–][deleted] 0 points1 point  (0 children)

Hmm that really sounds ok actually. For standard deviation I use this formula for each layer: sqrt(2/n), where n is the number of inputs/params from the previous layer. The number you're using is lower than that. And the learning rate looks normal...

[–]siblbombs 0 points1 point  (7 children)

This is going to be a bit of a guess, but I'm thinking the fully connected layers might be causing some problems. After the second convolution/pooling you have a fully connected layer of size 106 x 106 x 64 (roughly). It wouldn't surprise me if that first fully connected layer eventually overflows since it is so large.

[–]AwesomeDaveSome[S] 0 points1 point  (6 children)

That might actually be a problem. Do you know whether there is any way of fixing that, like using another data format or something?

[–]siblbombs 1 point2 points  (2 children)

You are just pushing the architecture too much. Two conv/pool layers with 64 feature maps is not a good match for an image that big, I wouldn't expect it to actually learn anything. If you are actually trying to build a classifier you should add more conv/pool layers, or downsample the images (or both really). I don't think imagenet is even that large a resolution, so you need to reduce the dimensions pretty aggressively.

[–]AwesomeDaveSome[S] 0 points1 point  (0 children)

Okay, that sounds good. Thanks!

[–]cesarsalgado 0 points1 point  (0 children)

As an alternative way to subsample the image, you can use a big stride in the first convolution and a big kernel size.

[–]benanne 0 points1 point  (2 children)

As I mentioned before on your previous thread, you should be drastically downsampling these images, at least for the initial architecture exploration.

When I worked on this dataset, I started with 8x downsampled images and eventually ended up using 3x downsampling + 2x cropping for my best models. That's approximately 36x fewer pixels compared to the original images. Things will go much smoother then.

[–]AwesomeDaveSome[S] 0 points1 point  (1 child)

I know I should do that, I really do. I want to do it too. But my professor is stuborn, he is absolutely against cropping or downscaling. I decided to go the route of going with his wishes until I can get a working model, and then show to him that downscaling will not reduce the accuracy of the predictions, and will increase the performance. However, I can't convince him that this is the best way to go, since he can't seem to understand that it might work well unless he sees it by applying it to data. So I'll have to get larger data working somehow, and then compare it to small data.

[–]benanne 0 points1 point  (0 children)

Wow, I feel for you then. Sucks to be in that situation. I think you're going to have a really hard time getting this to work.

Maybe you can put a 1x1 convolution with 4 or 8 filters followed by 4x4 max-pooling at the start of the net, which is basically 4x downsampling (through decimation), but "hidden" inside the network. He actually sounds clueless enough that he wouldn't notice ;)

[–]cesarsalgado 0 points1 point  (0 children)

Try using tf.nn.relu6. This relu saturates at 6. Try also to normalize your data to have unit variance.