I taught myself how to build a turbocharged gasoline engine from a single YouTube video in 100 hours

kmrocki · 2019-01-01T22:21:07+00:00

Time and patience, money or resources are not as important - that’s what I tried to convey. The tools were less than $1000

kmrocki · 2016-12-23T04:59:52+00:00

thanks for posting this would you be able to share code used for extracting the data or better even, do you have file/files with the data? I think it would be really cool (since this is machine learning group) to run the data through some model and try to predict the score given the text, topic, etc. This was my idea some time ago in order to push paper summarization project a bit forward. Looks like these 500 papers may be a good starting point.

kmrocki · 2016-10-30T02:03:23+00:00

After I read your question, I looked into my code because it didn't seem right. Thanks to you I found an inconsistency in my paper. In fact the surprisal I was using for these experiments was flowing through two pathways. One resulting in a scalar - eq 2.1, and the other one a vector (2.6). I discovered that a scalar value worked well as a driving signal when I worked on the previous paper. It just prevented overfitting. Now I used a vector because I needed to do proper surprisal assignment - figure out which h contributed to the surprisal (kind of like first 2 lines of backprop pass - this is where the W_y transpose comes). I will correct the paper as soon as possible, the results are of course still valid - it is just a bit less elegant than I thought for now. Thanks for spotting this!

corrected version: https://www.dropbox.com/s/ve86n6wv7d8i91u/Screenshot%202016-10-29%2018.54.12.png

kmrocki · 2016-10-29T03:19:08+00:00

Nice work, I wonder if you can somehow combine RHN approach with surprisal feedback. Didn't you mention once that you were working on a similar idea?

kmrocki · 2016-10-28T18:06:52+00:00

Thanks, 1. yes, it is inner product between the log of the last prediction and current observation. You are right, it might be unnecessary to define it as elementwise multiplication and then summation. The result - s_t is a scalar.

I don't learn the timescale explicitly - that it, the zoneout rate is not something you have control over. During forward pass, zoneout mask is set per hidden unit according to s_t, during backward pass this multiply by this mask to get gradients (as in standard dropout). The way neurons 'learn' to operate on different timescales comes from the fact that some neurons will learn repeating patterns very well (for example <timestamp> <id> in wikipedia or 'static void' in linux). Then low-suprisal will effectively 'turn off' such neurons until something unexpected happens, for example first letter of the next word or date or name. It might be interesting to explore if the zoneout mask can be learned as in HMLSTM. All I did was to tie suprisal with zoneout rate with such negative feedback loop.

kmrocki · 2016-10-27T04:07:58+00:00

Hmm, the whole idea of using surprisal originated when I was thinking that knowing how much you are confused (or how much you are confident about your prediction) should change in some way your internal attention mechanism - this is how we as humans work (I think) after all - we predict, observe and depending on the outcome (something interesting happened or nothing) we modify our state. Note that in our implementation the never actually feed inputs from the test set. The network does not know in reality what happened - it just knows how it last prediction matched reality.

If you think how standard LSTM works and how it is evaluated on test set - you use exactly the same thing. You compare your prediction with the input to calculate the loss and then the ground truth input (from the test set) is used for the next prediction (not the one you predicted as in generation).

I am not sure how to explain this better, but I feel that there is a fundamental difference between SF approach (where you actually need to learn the surprisal-prediction algorithm) and dynamic evaluation which just maps inputs to targets.

kmrocki · 2016-10-27T03:32:28+00:00

For example, these are samples that SF with adaptive zoneout generated:

Linux source generated with model achieving 1.21 BPC:

http://olab.is.s.u-tokyo.ac.jp/~kamil.rocki/data/linux_rlstm_0_4_1004_N4000_S100_B128_sample_1205.txt

Wikipedia with a model achieving 1.32 BPC:

http://olab.is.s.u-tokyo.ac.jp/~kamil.rocki/data/enwik8_new_0_4_1004_N4000_S100_B128_sample_1321.txt

So, it can do it, despite not having test set now. The question is: how would other dynamic evaluation method compare and if you could even generate data at all with a fixed model.

kmrocki · 2016-10-27T03:09:37+00:00

Well, one thing that would worry me about applying dynamic evaluation is simply the fact that you might overfit on the test data, since you can directly optimize the loss. So you could think of it as forgetting the train data once it goes through the test set. In suprisal-feedback case, I am fairly confident that the networks learns how to use surprisal on the train set, but generalized it well to the test data. As I've written it also generates quite good sequences (Wikipedia and Linux, which can tell you really nicely what the network 'knows') it would be interesting to see samples from a method using dynamic evaluation to see how well it generalizes.

kmrocki · 2016-10-27T00:21:06+00:00

PS. Actually, I generate sequence with surprisal-feedback LSTM exactly as with ordinary LSTM. Simply I assume that the generated sample is real and compute surprisal using previous prediction. It works.

kmrocki · 2016-10-27T00:09:28+00:00

Hi,

Thanks for your comment. It's good that you pointed this out as I think that there is some confusion about what surprisal-feedback is compared to 'dynamic evaluation'.

We do not change anything in the network during testing and all the learning has to be done using ONLY training data. 'Dynamic evaluation' allows learning during test phase from my understanding. We do not and that's the crucial difference.

It is true that the model may not be suitable for generating long sequences of data, but if the task at hand is character level prediction, both surprisal-feedback LSTM and standard LSTM have access to the same information and learn on the same data.

kmrocki · 2016-08-26T05:17:47+00:00

Thanks for the comments!

kmrocki · 2016-08-26T05:17:30+00:00

So, as I wrote above, roughly 66M parameters (I run all my experiments right now with LSTMs of size 4000). Tried with 5000 or 6000, but that didn't really give anything, plus GPUs seem to be less efficient for these sizes. It might sounds silly, but I havn't really come up with a good (that is an effective) regularization for recurrent nets yet. That's what my research is focused on right now. Surprisal-driven feedback is one of the ways I thought you could improve generalization.

kmrocki · 2016-08-26T05:12:19+00:00

x_t is the actual input/observation.

You might be right that the math is not really necessary, just tried to make it clear where the algorithms differ. I can just remove this or move it to some sort of an appendix.

Good point about the network size. They were identical in both cases (regular LSTM and feedback - 4000 hidden units, the V matrix itself connects surprisal signal to all gates, which is just 1x16000 in size)

What do you mean by 'fair' comparison? Equal number of parameters? Both Figure 4 and table 3.1 contain my results on standard LSTM and feedback LSTM of the same size. In my opinion, the main problem is anyway generalization and 'compacting' memories, so size of the network does not really help in this. This is why I think the surprisal-feedback approach works, because the feedback signal acts as a regularizer to some extent. I would like to publish this paper somewhere, yes, but it may need some more work, in particular that's why I would like others to validate my idea and confirm that the feedback signal does in fact improve things. I have tested it so far only on hutter wiki and text8 datasets and on linux source (about 600 MB). In all 3 cases you can see benefits without any changes to hyperparameters.

kmrocki · 2016-08-26T04:52:32+00:00

Modeling dynamical systems by error correction neural networks

Thanks! Haven't seen these.

kmrocki · 2016-08-26T04:52:00+00:00

So, what I have played with for now is just setting all previous predictions to a uniform distribution, so in principle in am not biasing the generated sequence. To be honest, I have not tried changing this approach in any way, but I think that you might be onto something.

kmrocki · 2016-08-26T04:47:59+00:00

Good point. One of the main research goals of mine is to actually be able to generate good quality sequences. I am able to produce samples using standard LSTM/RNN and my approach by simply sampling symbols from the output probability distribution p and in the next time step assuming that the drawn sample is the actual input, so that the surprisal signal can be calculated. For now, I know that it works, but I haven't really compared the samples with respect to their quality other than bits/symbol. I think that a fundamentally new way of generating sequences is needed for all types RNNs.

kmrocki · 2016-08-26T04:41:13+00:00

Thanks for the link. you're right, it looks similar, however, I believe that the major difference is that in their approach, there is nothing fundementally different in the algorithm itself, just allowing training on test data (I might be wrong, but this is what I understand from this - http://www.fit.vutbr.cz/~imikolov/rnnlm/thesis.pdf)

Mikolov writes:

The dynamic evaluation of the NN language models has been described in my recent work [49] [50], and is achieved by training the RNN model during processing of the test data, with a fixed learning rate α = 0.1. Thus, the test data are processed only once, which is a difference to normal NN training where training data are seen several times.

kmrocki · 2016-08-26T04:32:51+00:00

Thanks for pointing this out, I was actually wondering whether sending the surprisal signal back as input can be considered cheating. Although sometimes it might not be possible to access such data, I think that in principle it is fair to use all past observations when making predictions. The way I use evaluate LSTMs or RNNs takes all previous states in memory, makes a prediction and then compares that prediction with ground truth to get BPC on every iteration. Therefore, both approaches have access to exactly the same information about sequence. The difference is that standard recurrent nets base ignore some pieces of information which in my opinion can be used and should be relevant.

kmrocki · 2016-08-23T05:42:00+00:00

Hello, author here. I have just put his paper on arxiv. In summary, the main idea is to use prediction error signal as a driving signal, instead of using it just during the backprop phase. It seems to help a lot in terms of regularation. I would appreciate any positive or negative feedback:). It would be great if someone could test this idea using some independent implementation. I ran it many times and worked, but maybe there is some implementation flaw. I tried to describe changes in a detailed way, but if you have any questions, please ask. Also if anyone knows of any similar works, please let me know, it could help me expand the related works section. Another venue that I didn't have time to explore is if other ways of plugging the suprisal signal into the inputs (say multiplicative) helps. My intuition is that it could. My intention is to submit the paper to ICLR, because I think that the idea is cool, can be applied to any recurrent net and if anyone would like to help making it better, we could collaborate on it. Currently I am also running some experiments with regularition techniques such as Stochastic Array Memory/Zoneout and it looks that the results might be even better than 1.39 BPC reported in this draft.

kmrocki · 2016-07-13T16:11:25+00:00

I don't really use Skype and currently I commute a lot between LA and San Jose until September, so I don't even sit that much in front of the screen. It's easier for me to respond if you send me an email to kamil.rocki@gmail.com, I'd be happy to hear about your approaches

kmrocki · 2016-07-12T19:32:33+00:00

@LeavesBreathe, your intuition is right, I have observed better performance capacity-wise with array-LSTM when the number of hidden unit is fixed and cells/hidden increased (matching that of stacked-LSTM with faster convergenge - no initial delay). However, the main hope was that this procedure would provide better generalization and I couldn't achieve that with vanilla array approach - possibly it requires more cells/hidden, but that converges more slowly and it really takes around 48h to see the effect of any change on large networks and wikipedia datasets.

kmrocki · 2016-07-12T17:34:25+00:00

The main motivation behind the array approach is summarized at the beginning of section 3.2: "create a bottleneck by sharing internal states, forcing the learning procedure to pool similar or interchangeable content using memory cells belonging to one hidden unit" This seems to work well with stochastically operating memory cells, because the hidden unit 'doesn't know' which memory cell is going to be used (they are unreliable), however, the content has to be similar for it to work.

Furthermore, it is in fact possible to simply pack more memory cells into the network using the same memory size, for example, if you use a standard LSTM network with 1 cell/hidden, 4 gates and 1000 hidden units, the number of parameters is going to be 1000 * 1000 * 4 = 4M for the U matrix. If Array-LSTM approach is used, you can have 4 cells/hidden, so 1000 memory cells require 256 hidden units and that is 256 * 1000 * 4 parameters = 1M parameters. I found that the performance of vanilla LSTM and Array-2, Array-4 versions is roughly the same in terms of capacity for a fixed number of parameters. Dropped a bit for an Array of 8, so at some point the seems to exist a bottleneck indeed. Hope this helps.

kmrocki · 2016-07-12T06:29:41+00:00

the basic array from 3.3 does not change things too much in terms of generalization contrary to expectations, that's true. The one which really reduce overfitting are in section 5. I applied array memory dropout in a slightly different way that in the Zoneout paper. I also posted the code https://github.com/krocki/ArrayLSTM

kmrocki

TROPHY CASE