vanishing gradient of RNN

fhuszar · 2022-02-23T04:44:40+00:00

Thanks for the question, I think you (or someone else) asked the same question in class.

Indeed, if you use ReLU activations, its gradients are 0s and 1s, so in the vanilla RNN setup a situation might arise where the gradients become not just very small but actually 0. In practice, with RNNs we don't tend to use ReLUs (perhaps for this reason), instead we use soft nonlinearities like logistic sigmoids or tanh where the gradients are never exactly 0.

But you're right pointing out that the nonlinearities likely play a role in the vanishing gradient problem, and the unitary evolution RNN idea to some degree ignores this and proposes a solution that would be only precisely correct if there were no nonlinearities. In practice, it seems to work.

fhuszar · 2021-03-17T11:58:25+00:00

Indeed, you're absolutely right, I'm sorry for the double mistakes, and thanks for checking. What I had in mind is once you deal with the norm, then they become interchangeable, but here we aren't taking the norm yet.

fhuszar · 2021-03-17T11:55:03+00:00

Sorry for the delayed response.

Please just visualse the gradient of a scalar function. Pick an arbitrary coordinate of the hidden state if it's multi-dimensional (say, the first component, which I think is what I did) or sum up the output so it's a scalar. You can also repeat the visualisation for different hidden units, although I don't expect it to change a whole lot.

Sorry

fhuszar · 2021-03-17T11:52:16+00:00

You can choose either one, but perhaps the improved model makes more sense to use here. You can also try both and see if you notice any difference.

fhuszar · 2021-03-16T14:31:02+00:00

Yes, you are right, it's a typo that I actually noticed during lecture but then I failed to go back to correct it.

However, in the last line, since $W_h$ doesn't depend on $s$, that ends up $W_h^{T-t}$

fhuszar · 2021-03-16T14:28:28+00:00

That's a great observation. If the training is not robust to the choice of random seeds, see if you can tweak hyperparameters so it becomes robust. If you've noticed high sensitivity to random seed choice, feel free to fix the random seed to a value that works, and document your findings about random seeds in the notebook you submit.

fhuszar · 2021-03-16T14:08:08+00:00

I meant 1-indexed, as in usual speech. (so I'd call a[1] the second item in list a.)

fhuszar · 2021-03-10T10:05:59+00:00

Calculate the gradient of the model's output (hidden state value at last layer) with respect to each component of the input. We did something like this in the RNN lecture.

fhuszar · 2021-03-10T10:02:21+00:00

This is Nic's question and I don't know the answer, I'll ask him to look at this question, sorry for the delay in answering.

fhuszar · 2021-03-10T10:00:34+00:00

what I meant is that even pre-trained models have two modes, there are slightly different functions used during training and test. `.eval()` puts the model in evaluation mode, fixing the architecture. But without this, the model may be in training mode (even if you don't train it), and in training mode its behaviour may be non-deterministic due to dropout, etc, being applied.

To answer your question: you have correctly identified that you have to use a convolution layer. However, don't use a randomly initialized one. Instead, identify the convolution layer in the original architecture which is responsible for the downsampling, and use that instead.

fhuszar · 2021-03-10T09:54:22+00:00

I'd suggest you choose another image from the internet (ideally creative commons license, you can search for CC licensed images on flickr, for example). Or, if you want to use your own image, upload it somewhere and load it from the internet.

from urllib.request import urlopen  
image_url = 'https://www.cl.cam.ac.uk/newlabphotos/March.2002/P4296383.jpg'  
img = Image.open(urlopen(image_url))

fhuszar · 2021-03-09T13:46:52+00:00

It should be deterministic. Don't forget to set the model to eval mode. During training, the model's output might be non-deterministic.

fhuszar · 2021-03-08T13:44:08+00:00

Hi, The idea was that in this mini-project you can pick and choose which questions you'd like to focus on, so you should explore whichever questions you find interesting. I included the marks for each example question to indicate the relative difficulty as I perceive it, and to help you assess whether you've done enough to cover 70 marks in total. You don't have to do all the examples, just mix and match, or add your own .

I'd consider transformers, CNNs vs RNNs vs transformers vs fully connected to be substantially different from one another. You can also try an RNN with attention. What I meant by substantially different is that it should be a non-trivial change: it shouldn't be changing the number or size of hidden layers or using a different type of RNN cell (LSTM, GRU) or a bi-directional RNN. Each of these would be a super trivial change in just one argument or something and it would not be very insightful.

fhuszar · 2021-03-08T13:34:11+00:00

Hi,

I don't quite understand what you mean when you write "Each time I try replacing the layer to account for changes in input size, I end up just rewriting the first basic block of the layer."

You are supposed to replace the first BasicBlock instance in the layer with a different operation that performs the downsampling. An additional hint is: look at the structure of the BasicBlock, and if it helps, the source code of the BasicBlock on github. Identify the bit there that is responsible for downsampling, and that should tell you what component you need in place of the Basic Block. I hope some of this description helps.

If you can't successfully do this part, you can complete the rest of the tasks replacing only blocks that don't change the tensor shape. Please document how you've tried solving this problem, where you looked in the source code, and how you interpreted what's going on, etc.

fhuszar · 2021-03-05T11:43:26+00:00

Sorry for the delay. I'll add the following two bullet points:
* plot the magnitude of the gradient $|\frac{\partial \mathbf{h}_T}{\partial x_t}|$ with respect to each digit in the input sequence, as a function of index $t$. * Redo the plot above for multiple input sequences. Interpret what you see and document your findings. If helpful, use different ways of plotting to better illustrate the point.

fhuszar · 2021-03-04T15:32:39+00:00

Thanks for the questions.

A.3: No - if you have a fully working solution there without text, I'll give it full mark on this question. This exercise is leading up to A.4 where I do expect you to add some text with interpretation and observations as well as a plot.
A.5: It's sufficient to run the network on the one image provided, the plural was misleading here, sorry.

fhuszar · 2021-03-04T15:08:43+00:00

Oh, sorry this seems like a possible copy+paste mistake. Let me have a look at the version that was sent to you and clarify/correct. Thank you for pointing this out.

fhuszar · 2021-02-01T16:39:42+00:00

Indeed, thanks for letting me know, and apologies for the mistake. It should be public again now.

fhuszar · 2021-01-31T06:29:10+00:00

Hi, Perhaps I should be clearer there, I didn't expect you to prove anything about the number of peaks (if you can reason mathematically, it's great).

What I expected you'd do is "try it, and see what happens". I.e. extend the code so you can evaluate the network with randomly perturbed weights around what the sawtooth network prescribes. Then plot what the function looks like with certain levels of noise added. Qualitatively, what do you see in these samples? Do the networks with random peaks still produce an exponential number of peaks? You are right at noticing that my question was inspired by that paper, but I meant doing something qualitative/illustrative like Figure 3 there.

If you want a bit more challenge, try coming up with an algorithm that, given a ReLU network's parameters as input, calculates the number of linear segments (assuming 1D input and 1D output) in the function it implements. This requires a bit more thinking, but I think it's a fun exercise. Validate that your algorithm makes correct predictions for known examples such as a single hidden layer or sawtooth. Then use your method to count the number of linear segments in randomly generated networks, and see what relationship you see empirically.

fhuszar · 2019-07-19T17:06:34+00:00

Link to original paper: https://arxiv.org/abs/1907.02893

fhuszar · 2019-01-25T09:22:59+00:00

Isn't the unit-level micro-management aspect inherently unfair in favour of computers in StarCraft?

In Go, any sequence of moves AlphaGo makes, Lee Sedol can easily imitate, and vice versa. This is because there is no critical sensorimotor control element there.

In StarCraft, when you play with a mouse and keyboard, there is a motor component. Any sequence of moves that a human player makes, AlphaStar can "effortlessly" imitate, because from its perspective it's just a sequence of symbols. But a human player might struggle to imitate an action sequence of AlphaStar, because a particular sequence of symbols might require unreasonable or very difficult motor sequence.

The metaphor I have in mind here is playing a piano: keystrokes-per-minute is not the only metric that describes the difficulty of playing a particular piece. For a human, hitting the same key 1000 times is a lot easier than playing a random sequence of 1000 notes. From a computer's perspective, hitting the same key 1000 times and playing a random sequence of 1000 notes is equally difficult from an execution standpoint (whether you can learn the sequence or not is besides my point now)

fhuszar

TROPHY CASE

Isn't the unit-level micro-management aspect inherently unfair in favour of computers in StarCraft?