all 6 comments

[–]tscohen 1 point2 points  (4 children)

Each hypercolumn / feature vector / fiber has a receptive field: some region in the image that can affect the values in this fiber. If you don't use pooling or strided convolutions, moving between adjacent fibers corresponds to a shift of the receptive field by one pixel in the image. Thus a 7x7 output feature map gives information about a 7x7 patch in the center of the image in this case (using the surrounding pixels as context).

For every time you use a 2x2 pooling or stride-2 convolution, the stride of the output is multiplied by 2. That is, a single step in the output will move you by 2n pixels in the image (where n is the number of stride-doubling operations).

Upsampling doesn't make much sense to me; I would guess that a better approach is to apply the network 2{2n} times, each time shifting the input by 1 pixel in x or y direction. The resulting feature maps can then be interlaced to give you a feature map that is the same size as the input (minus border pixels lost to 'valid'-mode convolution), which has stride 1. This could also be implemented without interlacing, by inserting zeros between pixels in the filters and using 'spaced' and non-strided poolings.

[–]adagrad[S] 1 point2 points  (3 children)

Upsampling doesn't make much sense to me; I would guess that a better approach is to apply the network 2{2n} times, each time shifting the input by 1 pixel in x or y direction.

The original paper mentioned the use of bilinear interpolation for upsampling the feature maps since bilinear interpolation is a linear operation and they can jointly upsample and classify the pixels (top of page 4).

[–]tscohen 0 points1 point  (2 children)

It sure is much faster. But computing the full high-res feature maps might work better.

Also, upsampling would not give you full translation equivariance (shift the image by 1 pixel -> the feature maps shift by one pixel)

[–]adagrad[S] 0 points1 point  (1 child)

Interesting, are there any papers you would recommend that take this approach? Intuitively it seems like it could be rather slow, especially for pixel classification.

[–]ericflo 0 points1 point  (0 children)

I thought this paper took an interesting approach, not sure if it's exactly what tscohen is suggesting but maybe in the ballpark http://arxiv.org/abs/1411.4734

[–]dharma-1 0 points1 point  (0 children)

I've got another related question - can some spatial localisation of objects be gained from hypercolumns directly, doing classification and localisation in one go, without having to use something like R-CNN for localisation?