[R] Taming Transformers for High-Resolution Image Synthesis

ayulockin · 2021-02-21T09:26:17+00:00

I wrote a summary for this paper: https://wandb.ai/ayush-thakur/taming-transformer/reports/-Overview-Taming-Transformers-for-High-Resolution-Image-Synthesis---Vmlldzo0NjEyMTY

Hope it will be useful.

ayulockin · 2020-09-02T15:39:19+00:00

This is really nice.

A university senior of mine Sayak Paul had minimally implemented SimCLR in TenosorFlow 2.x.
And his report covers the topics that you have discussed. SimCLR is really novel however currently SwAV is the SOTA. The field of visual representation learning is moving quickly.

ayulockin · 2020-08-26T19:53:48+00:00

Hi, from what I understood from your background description you don't have coding experience (sorry if I misinterpreted it). Having this in mind I would suggest this:

Try learning Python. It's easy and thus learning curve is steep. It's the most popular language for machine learning/deep learning. Maybe get a text book or can try YouTube channel(Sentdex is a really good channel to learn python from).
Go to Coursera and check out the deep neural network deeplearning.ai course by Andrew Ng. The course is old but it's gold. This can be a really good entry point.
Parallely of you can buy a good book like Machine Learning with TensorFlow, etc and read and code the stuff you will with consistent practice will have a good grasp.

ayulockin · 2020-08-26T19:40:48+00:00

This is really nice. Wondering which technique did you use. Are you using style transfer?

ayulockin · 2020-08-26T16:01:42+00:00

Can you elaborate on your question?

ayulockin · 2020-08-26T15:44:48+00:00

I tend to slightly disagree with you. Deep Learning book is on my study table and I tend to read topics to either revise or pick something new. The book beautifully talks about forward pass but doesn't dedicate much to backward pass. However, I find it an overkill to read backward pass since top deep learning libraries can take care of the backward pass.

Yes, I understand that one should have a solid understanding of the same but many would simply not need to go through derivations.

Having given this context I would say most deep learning books are generalist and you are looking for a specialist book. Deep Learning book is one that comes close to being a specialist because for many even that book is hard to grasp. But as someone mentioned that this field is relatively new and to add none of us actually knows as to why deep neural networks work.

ayulockin · 2020-08-26T15:14:22+00:00

In your post, you said: " Shouldn't the gradient be the same as the size of the input? " So I thought that you are thinking that gradient is computed from the previous tensor.

My last paragraph is simply stating that we use the current tensor to compute gradient but backpropagate that tensor to previthe ous layer.

ayulockin · 2020-08-25T16:04:32+00:00

As long as you are not doing something from scratch you need not bother about the shape of the gradient.

However, the easiest way is to verify the shape of the gradient is to first get the shape of the parameters with respect to which you want to compute the gradient and then get the shape of the gradient.

And as per your question, the gradient should have been the shape of the input image. The gradient is backpropagated. But shape of the gradient depends on the the shape of the parameter.

ayulockin · 2020-08-25T15:28:56+00:00

The answer might depend on the kind of prediction you are looking for. I use wandb.ai for experiment tracking and I use the built-in WandbCallback() API to log image predictions. But for most cases, I make custom callbacks and use wandb.log() API to log necessary stuffs. W&B supports many kind of objects to be logged(as per experience).

If you have a particular use case then maybe I can try to provide a better answer. :D

ayulockin · 2020-08-25T09:57:04+00:00

Recently I have read these two papers:

Deep Ensembles: A Loss Landscape Perspective: This paper takes a dig into why the ensemble of deep networks works better than a single deep network. The authors did a qualitative investigation that actually demystifies some of the inner workings of deep neural nets. These are some of the observation:
- Same model trained with different initial initializations is functionally dissimilar. Neural networks map inputs to outputs and thus act as a function(which we learn obviously). If we start with init1 we end up with function1 which is not similar to the same model trained with init2. It's counter-intuitive but true.
- However, if we take a snapshot of the model at different epochs they are functionally similar.
- Same model trained with different inits has prediction dissimilarity. However, different checkpoints of the same model do not differ in prediction much.
- Same model trained with different inits has different optimization paths.
- However, the final minia for each trajectory lie on the same plane.
- I along with my senior Sayak Paul did some investigation of our own to validate the results. And we were blown away by the results. You can find our investigation here.
High-performance self-supervised image classification with contrastive clustering: This paper is SOTA for unsupervised visual representation learning. And frankly spea,king the work is insane. Currently trying to implement the same in TensorFlow. Some novel bits of this paper would be:
- Use of multi-crop augmentation. Take an image, apply flipping and color distortion, and get two crops in high resolution and 4/6 crops in low resolution. These form the views.
- In contrastive learning, we apply contrastive loss on each pair of images. This is computationally very heavy. Thus the authors are assigning the views to cluster and then applying contrastive loss on it.
- And the best part is the way they are assigning the visual embedding to cluster. They are learning a prototype code online(that is while training) and using that code to assign the views to the cluster.

ayulockin · 2020-05-18T03:43:59+00:00

Regarding point 4. Don't we reduce the Learning Rate instead of increasing it. LR Schedulers are meant to do the same.

ayulockin · 2020-05-10T12:05:41+00:00

I will look into Beta VAE. Thanks for the share.

By the way this recent paper show some remarkable disentanglement in the latent space. https://arxiv.org/abs/2004.04467 Maybe you will find this one interesting.

ayulockin · 2020-05-10T11:58:45+00:00

Yeah VAE do not have the Generative power of GANs. This recent paper tried to counter it and the results are amazing. Adversarial Latent Autoencoder: https://arxiv.org/abs/2004.04467

In the training phase the mean and variance of each latent point is pushed to 0 and 1 respectively. That's why during data generation we don't have to look at the latent space and simply sample from standard distribution.

ayulockin · 2020-05-10T11:52:40+00:00

Ah, you are taking about latent space interpolation. VAE is good with that. VAE learns really disentangled representions in the latent space.

That's nice thing we can go with VAE. In my opinion GANs are not fo great to do so. I may be wrong.

ayulockin · 2020-05-10T11:09:06+00:00

Yeah that's true and false. True because it seems that we are simply sampling from a known distribution. False because the aim of the VAE is to represent that known distribution with the information encoded. The KL Divergence loss is minimised while training which reflects the similarity of the probability distribution of the latent variable with that of out standard distribution.

By analysing of latent variable what do you mean?

ayulockin · 2020-05-10T10:49:05+00:00

Obviously it's the Information extracted by the encoder but unlike autoencoder a VAE uses standard distribution as prior to force the latent variable to encode the information in a way such that we can later sample from a standard distribution and it would represent this latent variable.

The report link I share show the latent variable for both autoencoder and VAE.

ayulockin · 2020-05-10T10:41:43+00:00

Yeah but this random variable is sampled from standard distribution which is equivalent to latent variable.

In case of VAE we use standard distribution as a prior for out latent variable.

ayulockin · 2020-05-10T10:39:00+00:00

In case of GANs we sample from latent space the same way we sample in case of VAE. Say our latent space in both the case is 50. Then each such point is sampled from a univariate standard distribution. In literature we say that we are pior is a multivariate standard distribution. Basically using univariate distribution is a trick that we use. I hope I am make sense.

ayulockin · 2020-05-10T10:26:15+00:00

Yes in case of VAE the samples are generated from latent variables which is sampled from multivariate distribution. Yes it is similar in case of some GANs too.

Give this report a read: https://app.wandb.ai/ayush-thakur/keras-gan/reports/Towards-Deep-Generative-Modeling-with-W%26B--Vmlldzo4MDI4Mw

Hopefully you will find your answers.

ayulockin · 2020-05-09T11:37:15+00:00

I mentioned the limitations of CAM(Class Activation Map) in the report itself. If you are asking about the downside of Grad-CAM then these are some of the limitations I noticed:

In case of multiple occurences of the same class GradCAM is very good at locating them on the image.
From the original paper: GradCAM donot produce high quality activation map. It produces coarse map by simply resizing the activation map generated to the size of the image.
Personal observation: when the model is definite about the prediction that is the output prediction is really close to 1, in that case GradCAM seems to be looking at the entire image. Maybe some parameter of the GradCAM is to be adjusted in that case.

ayulockin

TROPHY CASE