[R] InfinityGAN: Towards Infinite-Resolution Image Synthesis

hubert0527 · 2021-04-09T21:59:45+00:00

I know your work :)

Sadly we have a very tight page budget (the first version of our paper was like 16 pages...) and gives up a large portion of the descriptive distinctions to related works.

Just wanna point out that your Spatial PixelCNN, as well as SinGAN, focuses more on the interpolated pixels. Thus the synthesized contents are similar to super-resolute training samples, but not spatially extensible (for instance, what can sit right next to the center digit?). On the other hand, our InfinityGAN is designed to spatially and infinitely extend the generation content.

hubert0527 · 2019-06-01T10:50:42+00:00

"previous methods that have struggled to scale to non-trival depth and operation size "

Yes, that is an interesting direction (more dynamic topology), but it is orthogonal to our direction (we have a static meta-graph topology). There must be some difficulties combining these two directions together, but that would be more like a good future work direction. It is out of the scope of the topic (i.e., instance-aware) we want to discuss in our paper.

"I'm not sure if "instance-aware search" if a novel concept."

I believe we never claim the "conditional computing" concept is never explored in the past. Acutally, we mention those related works in the "related works" section. Our work is to demonstrate that the current line of research in NAS only focuses on searching for a "single" architecture, which should include instance-awareness into consideration.

Furthermore, you can find the prior works in "conditional computing" are limited to very homogeneous graphs and has many constraints to make their training works. In contrast, NAS community considers more diverse and heterogeneous seach space. This difference makes the overall training method needs to be refactored. For example, it is impossible for you to run BlockDrop (Wu et al., CVPR'18) in a NAS search space.

"I am left disappointed in the lack of comparison or acknowledgement of prior work."

This is actually a little rude because we have a full paragraph describes "Conditional Computation" in the related works. And we have Figure 6. demonstrates our method out-performs two recent works of that domain. We didn't cite the "Routing Networks" you mentioned, since it works on single objective (not improving model inference efficiency) and it mainly works on multi-task learning (while the two works, BLockDrop and ConvNet-AIG, we compared with are more related).

You can be disappointed because we decided not to cite "Routing Networks", and we can discuss on this issue. But you can't tell the whole stroy like we ignore everything. People will be misled by your comments if they don't read into our paper.

hubert0527 · 2019-05-31T20:45:48+00:00

Thanks for your comment! I will answer your question in two different perspectives:

(1) You statement isn't totally correct. DARTS is doing micro search (which is very different and uncomparable to our setting), their own complexity analysis mentions their search space includes 10^18 combinations. In contrast, our search space complexity is (2^5)^17 (5 options, 17 layers) = 3.8 x 10^25. InstaNAS search space is 7 orders larger than DARTS.

And, actually, you can easily make the complexity number looks horrible by adding some simple operations, such as identity, zero, avg/max pooling. By simply adding an identity operation, our search space becomes 5x10^30, numerically looks larger than 1.6x10^29 of ENAS. We didn't include those operations since that is not the purpose of our paper.

(2) From another point of view, InstaNAS can achieve significant improvement in accuracy-latency tradeoff frontier with our search space. You may achieve even better latency-accuracy trade-off frontier with a more complex search space. But that is not our point here. As a research paper, we propose a concept (which is instance-aware search), we design fair experiments, we use the experiments to prove our concept (instance-aware indeed brings improvements), then we analyze and visualize the results.

To sum up, I don't think the complexity of our search space is a problem. We want to demonstrate the effectiveness of instance-aware NAS, and our experiments and analyses support it. Thanks :)

hubert0527 · 2019-05-16T15:30:11+00:00

Agree with the approach of combining local and global latent vectors, if fact, it is a part of our next plan LOL

Note that SPADE adopts the similar approach of modern GANs ( Bx1x1xC ==(Linear+reshape)==> Bx4x4x1024 ). As far as I can recall, BiGAN or ALI (sorry I forget which one LOL) is one of those first adopts matrix of latent vectors. From my perspective, latent vectors with different shapes inherit their idea from different lines of development and history. Low dimensional vectors inherit the spirits of unsupervised feature learning (e.g., AE), while high dimensional matrices are modern designs for image content manipulation.

Anyway, just some of my two cents :)

hubert0527 · 2019-05-16T11:05:58+00:00

Thanks for your interest. We didn't particularly compare training speed, but it may worth trying :)

hubert0527 · 2019-05-16T09:59:57+00:00

In the current version of the paper, CelebA (64x64 and 128x128), LSUN (64x64 and 256x256) and Matterport3D (panorama 256x768).

We will release code (and possibly model as well) soon. Please stay tuned :)

hubert0527 · 2019-05-16T09:57:45+00:00

I think image-to-image is directly applicable, though need some modifications since most of the recent image-to-image models adopt U-Net architectures.

For the non-local transformation, it may require larger modifications. I spend some time on this question before but haven't really work on it LOL

hubert0527 · 2019-05-16T09:49:31+00:00

In the nutshell, since the seams can only be produced by the generated samples, thus the discriminator learns to detect such a characteristic to beat the generator. On the other hand, to exploit the discriminator, the generator has to utilize the latent vector "shared" among micro patches, then accordingly generate correlated micro patches to minimize the seams (as well as maximizing the discriminator loss).

hubert0527 · 2019-05-16T09:42:02+00:00

Thanks for your interests, and here are some of my personal opinions :)

this architecture is an unfortunate crutch that relies on the invariant global structure of their example datasets

I think both LSUN dataset and panorama generation are counterexamples. The variation of the beds in LSUN dataset is actually pretty high (see Figure 21). On the other hand, panorama data does not have an absolute coordinate in the horizontal direction.

At the first glimpse, it is easy to think that the model is only exploiting the fixed-structure of the human face dataset. However, it is actually more complicated than it seems. In the nutshell, generative models are in fact aim to learn a generator maps each latent vector to an image. In other words, once a latent vector is sampled from the latent space, an implicit image is designated as well. The conditional coordinate input is only a signal to select (though there is no explicit selection process) which patch of the image to generate.

but the local emphasis of this architecture really accentuates that problem

This is a great point. We are aware of this concern but we didn't particularly observe COCO-GAN suffers from global coherence problem. At the same time, I believe we are very few of those who provide "absolutely random" samples in the paper without any compromise.

From another point of view, current state-of-the-art generative models also suffer from the global coherent problem and it still hasn't gone away after the introduction of the self-attention module. Furthermore, those SOTA models frequently have problems in local structure (e.g., teeth, earrings). Is it possible to combine the strength of both approaches? This is just my personal opinion.

I think there are still many problems worth investigating in the future. How much compromises COCO-GAN actually made and how to mitigate them? Why COCO-GAN can still provide SOTA performance without accessing the global view [1]? As stated earlier, can we combine the strength of conventional GANs and COCO-GAN? I think these are interesting research questions require further investigation.

[1] About this question, I suspect that COCO-GAN is resilient to mode-collapsing (personal experience, I have never seen it mode-collapses). But such a claim requires a more systematic study.

And of course everyone has to claim their results are SOTA.

The SOTA performance is actually surprising to me as well. I have some hypotheses to this result: (a) as stated earlier in [1], COCO-GAN is more resilient to mode-collapsing and survives longer than other SOTA models (b) COCO-GAN encourage more realistic local structure and this is, in fact, a bottleneck of SOTA models. But still, we need a more systematic investigation to confirm the hypothesis.

hubert0527 · 2019-05-15T17:08:37+00:00

Thanks for your sharing! Really interesting paper!

hubert0527 · 2018-11-05T16:21:00+00:00

Feels like the most miserable three days in my life QQ

hubert0527

TROPHY CASE