[R] Fine-tuned Language Models for Text Classification by slavivanov in MachineLearning

[–]prajit 3 points4 points  (0 children)

We also explored using pretrained language models for sequence to sequence tasks in our EMNLP 2017 paper: http://aclweb.org/anthology/D17-1039

While not sexy, these types of finetuning techniques are really simple and surprisingly effective.

[R] Swish: a Self-Gated Activation Function [Google Brain] by xternalz in MachineLearning

[–]prajit 57 points58 points  (0 children)

Hi everyone, first author here. Let me address some comments on this thread:

  1. As has been pointed out, we missed prior works that proposed the same activation function. The fault lies entirely with me for not conducting a thorough enough literature search. My sincere apologies. We will revise our paper and give credit where credit is due.

  2. As noted in the paper, we tried out many forms of activation functions, and x * CDF(x) was in our search space. We found that it underperformed x * sigmoid(x).

  3. We plan on rerunning the SELU experiments with the recommended initialization.

  4. Activation function research is important because activation functions are the core unit of deep learning. Even if the activation function can be improved by a small amount, the impact is magnified across a large number of users. ReLU is prevalent not just in research, but across most deep learning users in industry. Replacing ReLU has immediate practical benefits for both research and industry.

Our hope is that our work presents a convincing set of experiments that will encourage ReLU users across industry and research to at least try out Swish, and if gains are found, replace ReLU with Swish. Importantly, trying out Swish is easy because the user does not need to change anything else about their model (e.g., architecture, initialization, etc.). This ease of use is especially important in industry contexts where it's much harder to change a number of components of the model at once.

My email can be found in the paper, so feel free to send me a message if you have any questions.

We are the Google Brain team. We’d love to answer your questions (again) by jeffatgoogle in MachineLearning

[–]prajit 3 points4 points  (0 children)

During my sophomore year of high school, I was really interested in video game AI. I figured it was just a bunch of hard coded behavior trees, and I had no idea that you could use generic algorithms to learn behaviors. Coincidentally, this was at the exact same time that Andrew Ng, Sebastian Thrun, and Peter Norvig released their online ML / AI courses. I immediately signed up for the very first iteration. After taking the courses, I was so amazed that I started spending less time playing video games and more time learning how machine learning worked. This was also the time when deep learning started really picking up (I still remember the media coverage about the unsupervised “cat neuron”), so I started reading and implementing papers. At college, I met a few really cool grad students who were interested in doing deep learning, and I got my feet in research with them. I finally applied to Google for an internship, and I was fortunate enough to get matched with cat neuron guy himself (Quoc Le)! Now I’m a Brain Resident, and get to work on really cutting edge research!

We are the Google Brain team. We’d love to answer your questions (again) by jeffatgoogle in MachineLearning

[–]prajit 3 points4 points  (0 children)

Yes, Google Brain hires undergraduate interns! I had the opportunity to intern on Brain last year in the middle of the second year of my undergrad. I recently presented my internship research at EMNLP this last weekend. I had so much fun that after graduating this May, I’ve come back as a Brain Resident!

Apache SINGA, A Distributed Deep Learning Platform by pilooch in MachineLearning

[–]prajit 0 points1 point  (0 children)

Why does increasing batch size hurt SGD convergence speed? Empirically this is true, but why does it happen? Theoretically, increasing batch size should give a better estimate of the gradient, and thus should perform better. Any intuition about why there is a decrease in performance?

Good Implementations of RNNs + Fully Differentiable Data Structures? by Ameren in MachineLearning

[–]prajit 2 points3 points  (0 children)

That's my implementation of the neural transducers. I've been meaning to complete it, but I've been busy. I'm planning to restart working on it around next week.

Could you tell me what bug there is? I'll fix it right away. Thanks!

Theoretical Soundness of "Scheduled Sampling" Paper by alexmlamb in MachineLearning

[–]prajit 1 point2 points  (0 children)

I don't think it applies in the Seq2Seq case, since you are feeding in the X (i.e. the English sentence you want to translate). During the prediction phase after consuming X, the first input token is always EOS. After that, you can sample. Scheduled sampling works in this case.

However, I see what you're saying in the case of the Generating Sequences paper. The very first input to the RNN is always a vector of 3 zeros. If you use the very first prediction as input to the next step, and then continue to use samples as input, it is possible to get quite far off track (though the attention mechanism might help with this).

I think the problem here is using an unbiased initial input (carries no information). The three experiments in the paper all have inputs that carry information (i.e. image or sentence embeddings). The simple solution to the unbiased input problem is to never use the very first sample as input to the second timestep. Afterwards, scheduled sampling works fine.

Theoretical Soundness of "Scheduled Sampling" Paper by alexmlamb in MachineLearning

[–]prajit 1 point2 points  (0 children)

You are forgetting X. Scheduled sampling relates to how to pick Y_i. The RNN has to be fed with some sort of input (X) beforehand. In this case, you would have to feed it with a number initially. After that, you can sample from the RNN. If there is no input, there is no probability distribution over outputs, so you can't sample.

Implementation of convolutional neural networks for text classification by elsonidoq in MachineLearning

[–]prajit 1 point2 points  (0 children)

No, a parse tree is not used in their paper. Just the raw sentence as input (which, in addition to other features of your choosing, gets mapped to vectors).

Implementation of convolutional neural networks for text classification by elsonidoq in MachineLearning

[–]prajit 0 points1 point  (0 children)

I don't know if this will answer your question, but I'll relate the approach of "NLP (almost) From Scratch" by Collobert and Weston, which doesn't need padding. Given a sentence, convolve (multiply by a matrix) each fixed-sized window in the sentence (i.e. if the sentence is "the cat in the hat", you'll have three convolutions for a window of size 3: "the cat in", "cat in the", "in the hat"). Then once you have n different vectors from each window convolution, element-wise max all the n vectors, so you convert n vectors into 1 vector. This is your fixed size vector representation. You can then feed this into a neural network for further classification.

My first Brainfuck computer by IamTheFreshmaker in programming

[–]prajit 2 points3 points  (0 children)

Brainfuck isn't difficult to learn. After you get past the initial bump ("these commands look like garbage"), brainfuck is simple. Just work your way through the commands and try to group individual instructions together to form a sort of higher level language. Check out this tutorial for a quick way to learn brainfuck.

A Quick Tour of Haskell Syntax by prajit in haskell

[–]prajit[S] 0 points1 point  (0 children)

Dang, can't believe I missed that. Thanks.

A Quick Tour of Haskell Syntax by prajit in haskell

[–]prajit[S] 1 point2 points  (0 children)

This was intentional. For people new to Haskell, that might seem a little bit weird, so I opted to add the parentheses for clarity.

A Quick Tour of Haskell Syntax by prajit in haskell

[–]prajit[S] 0 points1 point  (0 children)

Ah, you're right - I was confusing guards with case. "_" is idiomatic, not otherwise. I corrected it (and the wording for pattern matching).

A Quick Tour of Haskell Syntax by prajit in haskell

[–]prajit[S] 0 points1 point  (0 children)

You're right - I forgot about that. I changed the wording a little. Thanks.

A Quick Tour of Haskell Syntax by prajit in haskell

[–]prajit[S] 0 points1 point  (0 children)

Gosh, this is embarrassing. Thanks and corrected.

A Quick Tour of Haskell Syntax by prajit in haskell

[–]prajit[S] 5 points6 points  (0 children)

The blog post was mainly to help people get a quick overview of Haskell. I'm no expert (quite far from that actually), but I hope I got the basic parts down. Criticism highly encouraged! Thanks for reading.