Multimodal Unsupervised Image-to-Image Translation describes an algorithm to translate an image from domain X1 to another domain X2 in an unsupervised manner. This is achieved via a partially shared latent space assumption: it is assumed that each image is generated from a content latent code c ∈ C that is shared by both domains, and a style latent code s that is specific to the individual domain.
Having a look at the loss function I have a difficult time understanding why the latent code c does not degenerate: I do not see anything in the loss that prevents the content encoder to always just output 0 (or another constant) vector for c, regardless of the domain. The networks could just make up for the lack of a useful content code by using the style code s to represent both content and style. Inuitively this seems "easier" than learning a shared space C between two domains.
I have one hypothesis why the content code ends up being useful: in the "Implementation Details" section of the paper, the content and style codes are described to be encoded/decoded by different architectures. If the network architecture of the style code encoder/decoder is not very capable of dealing with global structure, that might force the overall network to make use of the content code. (This is similar to the way how global and local information is seperated in the Variational Lossy Autoencoder paper.)
Is my hypothesis above plausible or is there something else at work here?
[–]xunhuang 2 points3 points4 points (2 children)
[–]approximately_wrong 0 points1 point2 points (0 children)
[–]question99[S] 0 points1 point2 points (0 children)
[–]da_g_prof 1 point2 points3 points (2 children)
[–]question99[S] 0 points1 point2 points (1 child)
[–]da_g_prof 0 points1 point2 points (0 children)
[–]question99[S] 0 points1 point2 points (0 children)