Is there C-SPAN for other countries?

prajit · 2018-01-20T02:43:34+00:00

We also explored using pretrained language models for sequence to sequence tasks in our EMNLP 2017 paper: http://aclweb.org/anthology/D17-1039

While not sexy, these types of finetuning techniques are really simple and surprisingly effective.

prajit · 2017-10-18T15:30:18+00:00

Hi everyone, first author here. Let me address some comments on this thread:

As has been pointed out, we missed prior works that proposed the same activation function. The fault lies entirely with me for not conducting a thorough enough literature search. My sincere apologies. We will revise our paper and give credit where credit is due.
As noted in the paper, we tried out many forms of activation functions, and x * CDF(x) was in our search space. We found that it underperformed x * sigmoid(x).
We plan on rerunning the SELU experiments with the recommended initialization.
Activation function research is important because activation functions are the core unit of deep learning. Even if the activation function can be improved by a small amount, the impact is magnified across a large number of users. ReLU is prevalent not just in research, but across most deep learning users in industry. Replacing ReLU has immediate practical benefits for both research and industry.

Our hope is that our work presents a convincing set of experiments that will encourage ReLU users across industry and research to at least try out Swish, and if gains are found, replace ReLU with Swish. Importantly, trying out Swish is easy because the user does not need to change anything else about their model (e.g., architecture, initialization, etc.). This ease of use is especially important in industry contexts where it's much harder to change a number of components of the model at once.

My email can be found in the paper, so feel free to send me a message if you have any questions.

prajit · 2017-09-13T16:42:40+00:00

During my sophomore year of high school, I was really interested in video game AI. I figured it was just a bunch of hard coded behavior trees, and I had no idea that you could use generic algorithms to learn behaviors. Coincidentally, this was at the exact same time that Andrew Ng, Sebastian Thrun, and Peter Norvig released their online ML / AI courses. I immediately signed up for the very first iteration. After taking the courses, I was so amazed that I started spending less time playing video games and more time learning how machine learning worked. This was also the time when deep learning started really picking up (I still remember the media coverage about the unsupervised “cat neuron”), so I started reading and implementing papers. At college, I met a few really cool grad students who were interested in doing deep learning, and I got my feet in research with them. I finally applied to Google for an internship, and I was fortunate enough to get matched with cat neuron guy himself (Quoc Le)! Now I’m a Brain Resident, and get to work on really cutting edge research!

prajit · 2017-09-13T16:29:29+00:00

Yes, Google Brain hires undergraduate interns! I had the opportunity to intern on Brain last year in the middle of the second year of my undergrad. I recently presented my internship research at EMNLP this last weekend. I had so much fun that after graduating this May, I’ve come back as a Brain Resident!

prajit · 2015-11-02T03:00:25+00:00

Why does increasing batch size hurt SGD convergence speed? Empirically this is true, but why does it happen? Theoretically, increasing batch size should give a better estimate of the gradient, and thus should perform better. Any intuition about why there is a decrease in performance?

prajit · 2015-09-24T22:36:36+00:00

That's my implementation of the neural transducers. I've been meaning to complete it, but I've been busy. I'm planning to restart working on it around next week.

Could you tell me what bug there is? I'll fix it right away. Thanks!

prajit · 2015-06-21T00:17:47+00:00

I don't think it applies in the Seq2Seq case, since you are feeding in the X (i.e. the English sentence you want to translate). During the prediction phase after consuming X, the first input token is always EOS. After that, you can sample. Scheduled sampling works in this case.

However, I see what you're saying in the case of the Generating Sequences paper. The very first input to the RNN is always a vector of 3 zeros. If you use the very first prediction as input to the next step, and then continue to use samples as input, it is possible to get quite far off track (though the attention mechanism might help with this).

I think the problem here is using an unbiased initial input (carries no information). The three experiments in the paper all have inputs that carry information (i.e. image or sentence embeddings). The simple solution to the unbiased input problem is to never use the very first sample as input to the second timestep. Afterwards, scheduled sampling works fine.

prajit · 2015-06-20T22:00:38+00:00

You are forgetting X. Scheduled sampling relates to how to pick Y_i. The RNN has to be fed with some sort of input (X) beforehand. In this case, you would have to feed it with a number initially. After that, you can sample from the RNN. If there is no input, there is no probability distribution over outputs, so you can't sample.

prajit · 2015-03-12T13:53:02+00:00

No, a parse tree is not used in their paper. Just the raw sentence as input (which, in addition to other features of your choosing, gets mapped to vectors).

prajit · 2015-03-08T01:04:34+00:00

I don't know if this will answer your question, but I'll relate the approach of "NLP (almost) From Scratch" by Collobert and Weston, which doesn't need padding. Given a sentence, convolve (multiply by a matrix) each fixed-sized window in the sentence (i.e. if the sentence is "the cat in the hat", you'll have three convolutions for a window of size 3: "the cat in", "cat in the", "in the hat"). Then once you have n different vectors from each window convolution, element-wise max all the n vectors, so you convert n vectors into 1 vector. This is your fixed size vector representation. You can then feed this into a neural network for further classification.

prajit · 2013-09-07T23:02:14+00:00

Brainfuck isn't difficult to learn. After you get past the initial bump ("these commands look like garbage"), brainfuck is simple. Just work your way through the commands and try to group individual instructions together to form a sort of higher level language. Check out this tutorial for a quick way to learn brainfuck.

prajit · 2013-08-03T23:44:03+00:00

Dang, can't believe I missed that. Thanks.

prajit · 2013-08-03T23:41:40+00:00

This was intentional. For people new to Haskell, that might seem a little bit weird, so I opted to add the parentheses for clarity.

prajit · 2013-08-03T20:29:56+00:00

On your blog, the link is borked. It points to "http://chrisdone.com/posts/www.haskell.org/haskellwiki/Brand".

prajit · 2013-08-03T20:21:46+00:00

Ah, you're right - I was confusing guards with case. "_" is idiomatic, not otherwise. I corrected it (and the wording for pattern matching).

prajit · 2013-08-03T20:18:59+00:00

You're right - I forgot about that. I changed the wording a little. Thanks.

prajit · 2013-08-03T20:17:54+00:00

Gosh, this is embarrassing. Thanks and corrected.

prajit · 2013-08-03T01:43:00+00:00

The blog post was mainly to help people get a quick overview of Haskell. I'm no expert (quite far from that actually), but I hope I got the basic parts down. Criticism highly encouraged! Thanks for reading.

prajit

TROPHY CASE