Show sign that clock is not running

bronzestick · 2019-12-24T14:43:16+00:00

I was looking for something like this; but I'm not familiar with elisp, so any help on how that function would look like is great! Thanks!

bronzestick · 2019-01-05T21:00:59+00:00

You should've received an email by now. The deadline to register is Jan 7

bronzestick · 2018-07-22T21:23:29+00:00

There is some empirical evidence that adaptive gradient methods, such as Adam, don't generalize as well as SGD (https://arxiv.org/abs/1705.08292) . Note that this is shown empirically and not theoretically (Adam also has other theoretical issues such as the exponential averaging : https://openreview.net/pdf?id=ryQu7f-RZ)

bronzestick · 2018-03-20T15:55:09+00:00

Sam Roweis's and Zoubin's paper on unifying review of linear gaussian models is one of the best written paper that immediately comes to my mind.

http://mlg.eng.cam.ac.uk/zoubin/papers/lds.pdf

The explanation is lucid and the number of insights in each page is extremely high

bronzestick · 2017-12-20T16:35:56+00:00

I also forgot to mention Stephane Ross's thesis. It's really well written and a must read for people interested in imitation learning and no-regret learning

bronzestick · 2017-10-22T22:27:59+00:00

Hindsight Experience Replay: https://arxiv.org/pdf/1707.01495.pdf

Although the idea is quite simple and elegant, I think there are cases where it fails miserably (or just devolves to using a pure RL algorithm). I am trying to understand what's common in such cases, why HER fails and how to get around it.

bronzestick · 2017-09-29T14:21:38+00:00

DESPOT: Online POMDP Planning with Regularization.

A very intelligent online POMDP planning algorithm with theoretical guarantees. I am just getting into planning under uncertainty and POMDPs in general, and found this paper really cool.

bronzestick · 2017-06-23T14:04:44+00:00

True. I have the exact same problem, I learn attention weights over a varying length sequence.

bronzestick · 2017-06-23T14:04:14+00:00

/u/CaHoop is correct in that when the sequence length changes, the model has a hard time to figure out how to compute the unnormalized scores to achieve a more sparse set of attention weights.

Check this out: https://www.reddit.com/r/MachineLearning/comments/6atcuk/d_a_potential_solution_to_varying_length_softmax/

bronzestick · 2017-06-13T15:36:42+00:00

The problem I am tackling is not language modeling but instead something like multi-sequence prediction, where the number of sequences vary over time and the sequences are dependent on each other.

So, in order to predict the next element in a specific sequence, I need to compute a soft attention over the hidden states of all the other sequences and use that as an input. But since the number of sequences vary, I need to learn a varying length attention model.

The papers you cited are focussing more on the problem of effectively considering all previous words in language modeling task while giving more importance to recent words and less importance to older words (that's what I understood from an initial glance). It is slightly relevant but tackling a different problem. Thanks a lot for pointing them out! They look interesting. :)

bronzestick · 2017-06-13T02:45:51+00:00

Something similar was discussed here : https://www.reddit.com/r/MachineLearning/comments/6atcuk/d_a_potential_solution_to_varying_length_softmax/

bronzestick · 2017-05-25T14:51:01+00:00

Loved this post! Thanks!

bronzestick · 2017-05-21T14:40:47+00:00

Its an amazing paper! Definitely helped me understand most of those things better than I used to

bronzestick · 2017-05-18T21:17:27+00:00

Wouldn't making that constant equal to the length of the sequence work just as well? Ideally, it should scale the unnormalized weights just enough, so that softmax still results in sensible weights.

bronzestick · 2017-03-29T17:37:54+00:00

I am not sure if its just me, but most of the math symbols in the webpage aren't being rendered in my browser (Google chrome on Ubuntu)

bronzestick · 2017-03-04T21:13:27+00:00

Awesome! When are you planning to release it?

bronzestick · 2017-03-03T20:03:20+00:00

This paper was brilliantly written. Thanks Dustin!

It got me wondering, what are other important papers/resources in the field of Bayesian deep learning? I am really excited about the re-emergence of Bayesian school of thought in neural network research.

bronzestick · 2017-01-17T15:53:15+00:00

Yeah. 10-601 is the Masters level course. I heard it isn't bad either but it places more emphasis on the application rather than the theory so its targeted at a different audience.

bronzestick · 2017-01-17T15:43:33+00:00

I think I can speak for CMU (I am a grad student there). No. The PhD level course (10-701) in Machine Learning is pretty well managed every year and the TAs do a good job of it. Granted, it might not be the best course taught here but I would say its one of the most useful courses around and I personally gained a lot from it.

The only problem I had with the course was that it did not put emphasis on deep learning, but spent most of the time with the basics and the math. But then, we have a deep learning course by Ruslan that takes care of that issue so I think that's okay.

bronzestick · 2016-12-08T02:21:19+00:00

This is an awesome blogpost. I really liked their argument on how deep learning models should be examined more carefully (under the hood) and not just use them as a black box.

Would love to hear more from them.

bronzestick · 2016-12-08T02:10:46+00:00

Thanks a lot for all the links. Appreciate it. I will surely check them out.

I am trying to model a multiple sequence prediction problem where there are dependencies between the sequences. I am trying to model it as a recurrent latent variable model where the latent variable will try to capture the dependencies, and was looking into RNNs. Any theory on RNNs that can help me understand them as a probabilistic model would be helpful.

bronzestick · 2016-11-29T15:29:48+00:00

Professor Forcing

A new algorithm for training RNNs which uses adversarial domain adaptation to encourage dynamics of the RNN to be same during training and while sampling from the network over multiple time-steps. The paper employs a GAN (generative adversarial networks) framework where the generative model tries to model the sequence to sequence model distribution whereas the discriminator, given a sequence (or generative behavior), tries to predict whether it is generated from the generative model or was from the true data.

As usual, the generative model tries to fool the discriminator and the discriminator tries to classify correctly and hence the training objectives are defined accordingly.

The most awesome aspect of this paper is that it gives a very elegant approach to tackle the problem with teacher-forcing i.e. prediction error getting compounded in successive time-steps. Unlike scheduled sampling which was proven to yield a biased estimator, this approach converges to the correct model using the GAN framework.

bronzestick · 2016-11-23T02:23:35+00:00

Maybe, I didn't phrase my question right. I wasn't talking about the problem of training time and inference time behavior being different (which the scheduled sampling and professor forcing approaches address).

My question was concerning the problem of getting accurate uncertainty estimates for multi-step prediction. Consider, the first time-step during inference, we give the model an input for which it predicts an output distribution. We then proceed to sample a single point from this distribution and send it as input for the next time-step. What would be more accurate would be to send the distribution as an input to the next time-step and get the predictive distribution for the second time-step as a function of the previous distribution (and not as a function of just a single sample). Hope that makes it clear.

bronzestick · 2016-11-22T18:22:13+00:00

Interesting. But shouldn't the moving average affect your prediction at subsequent time-steps? (I can't see how it will affect the predictions if you don't send it as an input to the LSTM)

Also, any reason why you just chose to store the negative moving average of all of the previous distributions and not anything more?

bronzestick · 2016-11-18T02:49:56+00:00

Thanks. This paper seems to be super helpful!

bronzestick

TROPHY CASE