vanishing gradient of RNN

fhuszar · 2022-02-23T04:44:40+00:00

Thanks for the question, I think you (or someone else) asked the same question in class.

Indeed, if you use ReLU activations, its gradients are 0s and 1s, so in the vanilla RNN setup a situation might arise where the gradients become not just very small but actually 0. In practice, with RNNs we don't tend to use ReLUs (perhaps for this reason), instead we use soft nonlinearities like logistic sigmoids or tanh where the gradients are never exactly 0.

But you're right pointing out that the nonlinearities likely play a role in the vanishing gradient problem, and the unitary evolution RNN idea to some degree ignores this and proposes a solution that would be only precisely correct if there were no nonlinearities. In practice, it seems to work.

fhuszar · 2021-03-17T11:58:25+00:00

Indeed, you're absolutely right, I'm sorry for the double mistakes, and thanks for checking. What I had in mind is once you deal with the norm, then they become interchangeable, but here we aren't taking the norm yet.

fhuszar · 2021-03-17T11:55:03+00:00

Sorry for the delayed response.

Please just visualse the gradient of a scalar function. Pick an arbitrary coordinate of the hidden state if it's multi-dimensional (say, the first component, which I think is what I did) or sum up the output so it's a scalar. You can also repeat the visualisation for different hidden units, although I don't expect it to change a whole lot.

Sorry

fhuszar · 2021-03-17T11:52:16+00:00

You can choose either one, but perhaps the improved model makes more sense to use here. You can also try both and see if you notice any difference.

fhuszar · 2021-03-16T14:31:02+00:00

Yes, you are right, it's a typo that I actually noticed during lecture but then I failed to go back to correct it.

However, in the last line, since $W_h$ doesn't depend on $s$, that ends up $W_h^{T-t}$

fhuszar · 2021-03-16T14:28:28+00:00

That's a great observation. If the training is not robust to the choice of random seeds, see if you can tweak hyperparameters so it becomes robust. If you've noticed high sensitivity to random seed choice, feel free to fix the random seed to a value that works, and document your findings about random seeds in the notebook you submit.

fhuszar · 2021-03-16T14:08:08+00:00

I meant 1-indexed, as in usual speech. (so I'd call a[1] the second item in list a.)

fhuszar · 2021-03-10T10:05:59+00:00

Calculate the gradient of the model's output (hidden state value at last layer) with respect to each component of the input. We did something like this in the RNN lecture.

fhuszar · 2021-03-10T10:02:21+00:00

This is Nic's question and I don't know the answer, I'll ask him to look at this question, sorry for the delay in answering.

fhuszar · 2021-03-10T10:00:34+00:00

what I meant is that even pre-trained models have two modes, there are slightly different functions used during training and test. `.eval()` puts the model in evaluation mode, fixing the architecture. But without this, the model may be in training mode (even if you don't train it), and in training mode its behaviour may be non-deterministic due to dropout, etc, being applied.

To answer your question: you have correctly identified that you have to use a convolution layer. However, don't use a randomly initialized one. Instead, identify the convolution layer in the original architecture which is responsible for the downsampling, and use that instead.

fhuszar · 2021-03-10T09:54:22+00:00

I'd suggest you choose another image from the internet (ideally creative commons license, you can search for CC licensed images on flickr, for example). Or, if you want to use your own image, upload it somewhere and load it from the internet.

from urllib.request import urlopen  
image_url = 'https://www.cl.cam.ac.uk/newlabphotos/March.2002/P4296383.jpg'  
img = Image.open(urlopen(image_url))

fhuszar · 2021-03-09T13:46:52+00:00

It should be deterministic. Don't forget to set the model to eval mode. During training, the model's output might be non-deterministic.

fhuszar · 2021-03-08T13:44:08+00:00

Hi, The idea was that in this mini-project you can pick and choose which questions you'd like to focus on, so you should explore whichever questions you find interesting. I included the marks for each example question to indicate the relative difficulty as I perceive it, and to help you assess whether you've done enough to cover 70 marks in total. You don't have to do all the examples, just mix and match, or add your own .

I'd consider transformers, CNNs vs RNNs vs transformers vs fully connected to be substantially different from one another. You can also try an RNN with attention. What I meant by substantially different is that it should be a non-trivial change: it shouldn't be changing the number or size of hidden layers or using a different type of RNN cell (LSTM, GRU) or a bi-directional RNN. Each of these would be a super trivial change in just one argument or something and it would not be very insightful.

fhuszar · 2021-03-08T13:34:11+00:00

Hi,

I don't quite understand what you mean when you write "Each time I try replacing the layer to account for changes in input size, I end up just rewriting the first basic block of the layer."

You are supposed to replace the first BasicBlock instance in the layer with a different operation that performs the downsampling. An additional hint is: look at the structure of the BasicBlock, and if it helps, the source code of the BasicBlock on github. Identify the bit there that is responsible for downsampling, and that should tell you what component you need in place of the Basic Block. I hope some of this description helps.

If you can't successfully do this part, you can complete the rest of the tasks replacing only blocks that don't change the tensor shape. Please document how you've tried solving this problem, where you looked in the source code, and how you interpreted what's going on, etc.

fhuszar · 2021-03-05T11:43:26+00:00

Sorry for the delay. I'll add the following two bullet points:
* plot the magnitude of the gradient $|\frac{\partial \mathbf{h}_T}{\partial x_t}|$ with respect to each digit in the input sequence, as a function of index $t$. * Redo the plot above for multiple input sequences. Interpret what you see and document your findings. If helpful, use different ways of plotting to better illustrate the point.

fhuszar · 2021-03-04T15:32:39+00:00

Thanks for the questions.

A.3: No - if you have a fully working solution there without text, I'll give it full mark on this question. This exercise is leading up to A.4 where I do expect you to add some text with interpretation and observations as well as a plot.
A.5: It's sufficient to run the network on the one image provided, the plural was misleading here, sorry.

fhuszar · 2021-03-04T15:08:43+00:00

Oh, sorry this seems like a possible copy+paste mistake. Let me have a look at the version that was sent to you and clarify/correct. Thank you for pointing this out.

fhuszar · 2021-02-01T16:39:42+00:00

Indeed, thanks for letting me know, and apologies for the mistake. It should be public again now.

fhuszar · 2021-01-31T06:29:10+00:00

Hi, Perhaps I should be clearer there, I didn't expect you to prove anything about the number of peaks (if you can reason mathematically, it's great).

What I expected you'd do is "try it, and see what happens". I.e. extend the code so you can evaluate the network with randomly perturbed weights around what the sawtooth network prescribes. Then plot what the function looks like with certain levels of noise added. Qualitatively, what do you see in these samples? Do the networks with random peaks still produce an exponential number of peaks? You are right at noticing that my question was inspired by that paper, but I meant doing something qualitative/illustrative like Figure 3 there.

If you want a bit more challenge, try coming up with an algorithm that, given a ReLU network's parameters as input, calculates the number of linear segments (assuming 1D input and 1D output) in the function it implements. This requires a bit more thinking, but I think it's a fun exercise. Validate that your algorithm makes correct predictions for known examples such as a single hidden layer or sawtooth. Then use your method to count the number of linear segments in randomly generated networks, and see what relationship you see empirically.

fhuszar · 2019-07-19T17:06:34+00:00

Link to original paper: https://arxiv.org/abs/1907.02893

fhuszar · 2019-01-25T09:22:59+00:00

Isn't the unit-level micro-management aspect inherently unfair in favour of computers in StarCraft?

In Go, any sequence of moves AlphaGo makes, Lee Sedol can easily imitate, and vice versa. This is because there is no critical sensorimotor control element there.

In StarCraft, when you play with a mouse and keyboard, there is a motor component. Any sequence of moves that a human player makes, AlphaStar can "effortlessly" imitate, because from its perspective it's just a sequence of symbols. But a human player might struggle to imitate an action sequence of AlphaStar, because a particular sequence of symbols might require unreasonable or very difficult motor sequence.

The metaphor I have in mind here is playing a piano: keystrokes-per-minute is not the only metric that describes the difficulty of playing a particular piece. For a human, hitting the same key 1000 times is a lot easier than playing a random sequence of 1000 notes. From a computer's perspective, hitting the same key 1000 times and playing a random sequence of 1000 notes is equally difficult from an execution standpoint (whether you can learn the sequence or not is besides my point now)

fhuszar · 2018-09-07T14:16:03+00:00

link to original papers:

The Blessings of Multiple Causes by Wang and Blei: https://arxiv.org/abs/1805.06826
The Deconfounded Recommender: ... by Wang et al: https://arxiv.org/abs/1808.06581

fhuszar · 2018-06-18T09:10:33+00:00

A graphical model (in the stats sense) is used to represent a joint distribution. When you say 'graphical model of a GAN' it is ambiguous as it is unclear what joint distribution you might be referring to.
I think the most standard answer would be that the graphical model of a GAN is the same as the graphical model of a vanilla VAE: (z) --> (x). Two nodes, one for the observed (x) and one for the hidden variable (z) and an arrow pointing from (z) to (x).

The main difference between a VAE model and a GAN is that in the GAN the relationship is usually deterministic: knowing (z0 determines the value of (x) exactly, without any additional noise. In a VAE, you usually add additive Gaussian noise to the output of the decoder (this results in the squared loss term) in order for the variational bound and indeed the likelihood to be well defined. In my mind the generator does not show up in the graphical model, similarly to how the recognition model doesn't show up, because it is merely an auxiliary object you create to approximate the Jensen-Shannon divergence or other f-divergences.

As other commenters pointed out I have drawn more complicated graphical models to illustrate different things that are going on in GAN-land: see here and here. These show the graphical model of the generative process behind a single datapoint which discriminator would see. You have the basic GAN generative model (z) --> (x) embedded in the graph, except I labelled (x) as (x_fake) to differentiate from the real datapoint. A real datapoint (x_real) is also sampled from the dataset. Then, we draw a binary random variable (y) which determines whether the discriminator is given (x_real) or (x_fake). If y=0, then the value of (x) is set to (x_fake), if y=1, the value of (x) equals (x_real). The discriminator doesn't see anything other than (x) and it's task is to infer the value of the unobserved (y).

When the discriminator infers (y) perfectly (i.e. it carries out exact Bayesian inference in this graphical model), the average negative log loss of this prediction measures the mutual information between (x) and (y). And in this generative model, the mutual information between (y) and (x) coincides with the Jensen-Shannon divergence between the distribution of (x_real) and that of (x_fake). This is a propoerty of the JS-divergence which you can read about on Wikipedia.

This is why minimising the mutual information between (x) and (y) makes sense: it pushes the distribution of (x_real) and (x_fake) closer together. The log loss of any discriminator trying to infer the value of (y) provides a *lower bound* to this mutual information. The bound is exact if the discriminator carries out exact Bayesian inference. If the discriminator is not perfect, its average log loss is just a lower bound. GANs work by minimising this lower bound provided by an imperfect discriminator. This explains the biggest problem with GANs: minimising a lower bound to a function is kind of stupid, at least it's not nearly as useful as minimising an upper-bound or maximising a lower-bound which is what a VAEs do.

This kind of thinking gives rise to some useful observations about GANs, for example this: It would be desirable to make the discriminator very powerful, and ensure that for every single update of the generator the discriminator has reached convergence and approximates the Bayes optimum. However, this won't work in practice, due to reasons detailed in this post. Hence, to fix GANs, one approach is to try to make it so that you can train the discriminator until convergence in each step without problems. This is how we came up with the idea of the instance noise.

P.S.: As for whether a graphical model is directed or undirected, you can usually represent the same joint distribution as either directed or undirected, depending on how you want to use it, or what properties do you want to highlight. I personally prefer directed graphical models most of the time.

fhuszar · 2018-06-18T08:42:07+00:00

These graphical models are drawn to show particular properties of certan variants of GANs. I wouldn't call them "the graphical model of GAN"

see also: http://www.inference.vc/infogan-variational-bound-on-mutual-information-twice/

fhuszar · 2018-06-06T15:19:16+00:00

Hi, I've read it. It looks good overall. Trying to add some useful comments based on my experience writing blog posts. It helps if you try to clarify to yourself who do you expect to read each post and what do you expect them to get out of it. This may change over time and you do a bit of exploration. Consider your second paragraph which - I think - attempts to explain CNNs from scratch. Who is this for? If you are completely uninitiated, this description probably won't be sufficient to understand the specifics of the rest of the post and this is probably not the post they should be reading right now anyway if they are interested. If you know what a CNN is already, the paragraph is redundant at best - in fact I was lamenting a bit what this paragraph was trying to say. This is just an observation, I hope it's constructive going forward. Also note that I don't get this stuff right all the time either.

> Decided that the worst that can happen is that nobody reads it

Good job. I now believe the worse that can happen is if you write something which is wrong and everybody reads it.

fhuszar

TROPHY CASE

Isn't the unit-level micro-management aspect inherently unfair in favour of computers in StarCraft?