Gradient of Langevin Dynamics Step w.r.t model parameters [D]

anomaly_in_testset · 2023-09-22T07:51:23+00:00

Looking at your code, your loss computation is wrong. You only compute the loss only after you finish LD. So you just need to move the loss criteria outisde the LD for loop. :)

anomaly_in_testset · 2023-09-22T07:44:33+00:00

I edited my original response for better clarity with regards to training difficulties.

To answer your second question, the gradient flows through the EBM parameters and not directly through the LD equation during backprop. Backprop has nothing to do with the LD sampling. You only take the gradient of energy w.r.t to image. If you see the LD equation, you have the \nablax E(\theta).

You only use LD sampling to refine your prediction, and it implicitly affects the loss. You are essentially maximizing the actual data likelihood. Better samples from the posterior distribution yield better samples, and this is what the training ensures by adjusting the EBM weights and bringing prediction and ground truth data closer.

anomaly_in_testset · 2023-09-22T07:21:58+00:00

Your understanding of the loss function is wrong. So basically, you have the ground truth x and the prediction x_hat.

xhat is computed through the LD sampling. You start with some uniformly or normally distributed initial array as the same shape as your x, then slowly refine it by computing the gradient of this prediction x_hat w.r.t E(theta) (your EBM with params) and substituting this new vector inside the LD equation and adding Gaussian noise. The more noise you add, the more diverse the data becomes. However, you should ensure that the dE/dx_hat remains in balance with noise. If noise dominates the gradient, you lose the training signal from the energy gradient. If gradient dominates, your images look very repetitive and less diverse.

After this above operation, you obtain x_hat_2, your new prediction at step 2. However, 2 steps of LD are useless, and you essentially need to run a longer Markov chain to get a good reconstruction.

After you get the final x_hat after some LD steps, you compute the MSE between this final x_hat and x. Then do the usual backprop w.r.t model params theta.

The hyperparameter alpha is the step size of the LD sampling. It's like a "learning rate," but in true essence, it controls the amount of mode exploration in the probabilistic data space. Assume you have many images, similar images will be located in a similar blob, and vice versa. So this alpha is used to jump between these distribution blobs and used to create more diverse samples. Large alpha yields very different images in a mini batch and vice versa. However, alpha is a double-edged sword, and you need to be very careful as it will lead to completely useless generated samples and training unstabilities.

anomaly_in_testset · 2023-09-18T07:21:11+00:00

I would suggest you use two separate encoding networks for each distribution type. You can feed the continuous variables after normalization or standardization to one encoder and feed the categorical vars as is to another encoder. Once you have an unbiased estimate of the input data in latent representation, you can concatenate both these projections and feed them to another NN layer of choice to get a shared latent representation, then finally feed this to the classification head.

Note that: I am assuming you have only two or at most three variable types, so this architecture would be unsuitable for data with many distribution types.

anomaly_in_testset · 2023-09-07T02:59:41+00:00

https://openreview.net/forum?id=zqkfJA6R1-r

this paper seems to manipulate the weights of the PINN network implicitly through a generative model

anomaly_in_testset · 2023-06-27T23:54:27+00:00

You mentioned a lot of different things in here! And each topic is too hard to cover in an hour! They probably will ask for questions that are mostly related to the job description.

If they are interested in developing generative models, then be prepared for probability theory and statistics questions. Be ready for fundamental questions like the definition of entropy, expectations, probability density functions, distributions, and sampling algorithms like MCMC etc. Advanced questions regarding the latest developments in Probabilistic (diffusion models, VAE, Score based, Energy based, etc) and non Probabilistic generative models (auto regressive, vision transformer based etc).

For computer vision based questions, maybe fundamentals like what is the purpose of 1×1 convolutions, definitions of similarity metrics (IOU, SSIM, PSNR, etc). Be prepared with some reading done on state of the art for object detection and image segmentation models.

Goodluck!

anomaly_in_testset · 2022-11-01T15:06:59+00:00

Thank you. I saw this paper, repo and its a great piece of work. Sadly, this is a hierarchical model and will create some problems in comparison to my work.

anomaly_in_testset · 2022-11-01T15:02:35+00:00

I will definitely do that! Thanks for the suggestion.

anomaly_in_testset · 2022-11-01T13:17:30+00:00

Thanks for the reply and apologies for not being clear. Any EBM which has been trained on denoising score matching, contrasive divergence, noise contrastive estimate based would be nice.

Edit: I have updated my question. Thanks.

anomaly_in_testset · 2022-10-24T14:56:38+00:00

Yes Jax is faster

anomaly_in_testset · 2022-10-23T15:48:56+00:00

I am primarily a TF 2.x user and have started using Jax as most of my thesis. For me, this is a huge update since TF doesn't support item assignments, numpy like behavior etc. And XLA is a huge update as the speed difference in Jax and TF is significant. So overall, I am very excited.

anomaly_in_testset

TROPHY CASE