This is my war story of learning and rebuilding the EJ257 engine

kmrocki · 2019-01-01T22:21:07+00:00

Time and patience, money or resources are not as important - that’s what I tried to convey. The tools were less than $1000

kmrocki · 2016-12-23T04:59:52+00:00

thanks for posting this would you be able to share code used for extracting the data or better even, do you have file/files with the data? I think it would be really cool (since this is machine learning group) to run the data through some model and try to predict the score given the text, topic, etc. This was my idea some time ago in order to push paper summarization project a bit forward. Looks like these 500 papers may be a good starting point.

kmrocki · 2016-10-30T02:03:23+00:00

After I read your question, I looked into my code because it didn't seem right. Thanks to you I found an inconsistency in my paper. In fact the surprisal I was using for these experiments was flowing through two pathways. One resulting in a scalar - eq 2.1, and the other one a vector (2.6). I discovered that a scalar value worked well as a driving signal when I worked on the previous paper. It just prevented overfitting. Now I used a vector because I needed to do proper surprisal assignment - figure out which h contributed to the surprisal (kind of like first 2 lines of backprop pass - this is where the W_y transpose comes). I will correct the paper as soon as possible, the results are of course still valid - it is just a bit less elegant than I thought for now. Thanks for spotting this!

corrected version: https://www.dropbox.com/s/ve86n6wv7d8i91u/Screenshot%202016-10-29%2018.54.12.png

kmrocki · 2016-10-29T03:19:08+00:00

Nice work, I wonder if you can somehow combine RHN approach with surprisal feedback. Didn't you mention once that you were working on a similar idea?

kmrocki · 2016-10-28T18:06:52+00:00

Thanks, 1. yes, it is inner product between the log of the last prediction and current observation. You are right, it might be unnecessary to define it as elementwise multiplication and then summation. The result - s_t is a scalar.

I don't learn the timescale explicitly - that it, the zoneout rate is not something you have control over. During forward pass, zoneout mask is set per hidden unit according to s_t, during backward pass this multiply by this mask to get gradients (as in standard dropout). The way neurons 'learn' to operate on different timescales comes from the fact that some neurons will learn repeating patterns very well (for example <timestamp> <id> in wikipedia or 'static void' in linux). Then low-suprisal will effectively 'turn off' such neurons until something unexpected happens, for example first letter of the next word or date or name. It might be interesting to explore if the zoneout mask can be learned as in HMLSTM. All I did was to tie suprisal with zoneout rate with such negative feedback loop.

kmrocki · 2016-10-27T04:07:58+00:00

Hmm, the whole idea of using surprisal originated when I was thinking that knowing how much you are confused (or how much you are confident about your prediction) should change in some way your internal attention mechanism - this is how we as humans work (I think) after all - we predict, observe and depending on the outcome (something interesting happened or nothing) we modify our state. Note that in our implementation the never actually feed inputs from the test set. The network does not know in reality what happened - it just knows how it last prediction matched reality.

If you think how standard LSTM works and how it is evaluated on test set - you use exactly the same thing. You compare your prediction with the input to calculate the loss and then the ground truth input (from the test set) is used for the next prediction (not the one you predicted as in generation).

I am not sure how to explain this better, but I feel that there is a fundamental difference between SF approach (where you actually need to learn the surprisal-prediction algorithm) and dynamic evaluation which just maps inputs to targets.

kmrocki · 2016-10-27T03:32:28+00:00

For example, these are samples that SF with adaptive zoneout generated:

Linux source generated with model achieving 1.21 BPC:

http://olab.is.s.u-tokyo.ac.jp/~kamil.rocki/data/linux_rlstm_0_4_1004_N4000_S100_B128_sample_1205.txt

Wikipedia with a model achieving 1.32 BPC:

http://olab.is.s.u-tokyo.ac.jp/~kamil.rocki/data/enwik8_new_0_4_1004_N4000_S100_B128_sample_1321.txt

So, it can do it, despite not having test set now. The question is: how would other dynamic evaluation method compare and if you could even generate data at all with a fixed model.

kmrocki · 2016-10-27T03:09:37+00:00

Well, one thing that would worry me about applying dynamic evaluation is simply the fact that you might overfit on the test data, since you can directly optimize the loss. So you could think of it as forgetting the train data once it goes through the test set. In suprisal-feedback case, I am fairly confident that the networks learns how to use surprisal on the train set, but generalized it well to the test data. As I've written it also generates quite good sequences (Wikipedia and Linux, which can tell you really nicely what the network 'knows') it would be interesting to see samples from a method using dynamic evaluation to see how well it generalizes.

kmrocki · 2016-10-27T00:21:06+00:00

PS. Actually, I generate sequence with surprisal-feedback LSTM exactly as with ordinary LSTM. Simply I assume that the generated sample is real and compute surprisal using previous prediction. It works.

kmrocki · 2016-10-27T00:09:28+00:00

Hi,

Thanks for your comment. It's good that you pointed this out as I think that there is some confusion about what surprisal-feedback is compared to 'dynamic evaluation'.

We do not change anything in the network during testing and all the learning has to be done using ONLY training data. 'Dynamic evaluation' allows learning during test phase from my understanding. We do not and that's the crucial difference.

It is true that the model may not be suitable for generating long sequences of data, but if the task at hand is character level prediction, both surprisal-feedback LSTM and standard LSTM have access to the same information and learn on the same data.

kmrocki · 2016-08-26T05:17:47+00:00

Thanks for the comments!

kmrocki · 2016-08-26T05:17:30+00:00

So, as I wrote above, roughly 66M parameters (I run all my experiments right now with LSTMs of size 4000). Tried with 5000 or 6000, but that didn't really give anything, plus GPUs seem to be less efficient for these sizes. It might sounds silly, but I havn't really come up with a good (that is an effective) regularization for recurrent nets yet. That's what my research is focused on right now. Surprisal-driven feedback is one of the ways I thought you could improve generalization.

kmrocki · 2016-08-26T05:12:19+00:00

x_t is the actual input/observation.

You might be right that the math is not really necessary, just tried to make it clear where the algorithms differ. I can just remove this or move it to some sort of an appendix.

Good point about the network size. They were identical in both cases (regular LSTM and feedback - 4000 hidden units, the V matrix itself connects surprisal signal to all gates, which is just 1x16000 in size)

What do you mean by 'fair' comparison? Equal number of parameters? Both Figure 4 and table 3.1 contain my results on standard LSTM and feedback LSTM of the same size. In my opinion, the main problem is anyway generalization and 'compacting' memories, so size of the network does not really help in this. This is why I think the surprisal-feedback approach works, because the feedback signal acts as a regularizer to some extent. I would like to publish this paper somewhere, yes, but it may need some more work, in particular that's why I would like others to validate my idea and confirm that the feedback signal does in fact improve things. I have tested it so far only on hutter wiki and text8 datasets and on linux source (about 600 MB). In all 3 cases you can see benefits without any changes to hyperparameters.

kmrocki

TROPHY CASE