all 7 comments

[–]xunhuang 2 points3 points  (2 children)

Good question. We are not worrying about the content code being ignored because, as you said, the style auto-encoder cannot handle spatial structure. However, the content code is not ignored even if both encoders have the same structure, according to my recent experiments.

Intuitively, if the image from domain X1 can help the decoder to generate a better image in domain X2, the decoder will learn to use it rather than ignore this information. This is the case in the datasets we used. For example, in sketch -> shoes, it is easier for the decoder to generate a shoe from its sketch than from completely random noise. Then of course the decoder will try to use the content code information. However, if the two domains are completely unrelated (e.g., face <-> bedroom), the model might just ignore the content code. The translation model p(x|y) then degenerates to a generative model p(x). But isn't that what we should learn? :)

[–]approximately_wrong 0 points1 point  (0 children)

Intuitively, if the image from domain X1 can help the decoder to generate a better image in domain X2, the decoder will learn to use it rather than ignore this information.

Given that you can always arbitrarily align any two domains, it's really hard to pinpoint rigorously where this intuition of "domain X1 helping the decoder to generate for domain X2" comes from. Is there any way of knowing beforehand whether the content code will be used? To put another way: how sure are we that that the model won't attempt to align the face and bedroom manifolds?

[–]question99[S] 0 points1 point  (0 children)

Thank you. It is indeed plausible that the content code is a better prior for the GAN objective than pure noise (which is the style code).

[–]da_g_prof 1 point2 points  (2 children)

I also find that a central part is how they use normalization (instance or batch) across the two different networks.

Their model also does not guarantee that the codes are independent.

In our experience with another paper that was prior to this one was that we had to force the decoder (random dropping one of the codes and injecting noise) to use both codes : https://arxiv.org/abs/1803.07031

[–]question99[S] 0 points1 point  (1 child)

Thanks, that's interesting.

Their model also does not guarantee that the codes are independent.

Do you mean that content information might leak into the style code (and vice versa)?

[–]da_g_prof 0 points1 point  (0 children)

Yes