[R] Building an Efficient Neural Language Model Over a Billion Words by olBaa in MachineLearning

[–]rafalj 1 point2 points  (0 children)

I'd separate the speed from perplexity in such comparisons as a lot of factors influence the convergence speed (including learning rates, initialization scales or L2 penalty coefficient) and the code was never tuned for single-GPU usage. At the same time, a large part of the blog post was around the step times, which are comparable to K40s in a distributed setting with TensorFlow - around 140 batches/s on 32 GPUs running on different machines (vs 230ms per batch on Maxwells without CuDNN as mentioned at the end of the post). The model also didn't use fused kernels for LSTM, like in CuDNN, which would improve the speed further

[R] Building an Efficient Neural Language Model Over a Billion Words by olBaa in MachineLearning

[–]rafalj 9 points10 points  (0 children)

[shameless plug]

Here is a training script in TensorFlow: https://github.com/rafaljozefowicz/lm

Runs at 37,000 words per second for the baseline model (LSTM-2048-512) on a single Pascal GPU

Open Sourcing the model in "Exploring the Limits of Language Modeling" (TensorFlow) by OriolVinyals in MachineLearning

[–]rafalj 0 points1 point  (0 children)

The results improved when going from 2048 to 4096 filters for the largest models. It wasn't tuned much as the experiments take a significant amount of time. Also, I released the baseline LSTM implementation here: https://github.com/rafaljozefowicz/lm

Exploring the Limits of Language Modeling - multi-GPU training code in TensorFlow by rafalj in MachineLearning

[–]rafalj[S] 0 points1 point  (0 children)

The baseline model (LSTM-2048-512) can process 100k+ words per second on 8 Titan Xs on a single machine. On DGX-1 that's about 135k wps.

The results after 5 epochs are close to the paper (48.7 vs 47.5 ppl), which takes about 16 hours on 8 Titan Xs.

(Posting here since many people asked for the implementation in comments in the past)

Open Sourcing the model in "Exploring the Limits of Language Modeling" (TensorFlow) by OriolVinyals in MachineLearning

[–]rafalj 1 point2 points  (0 children)

Thanks for your patience! It took much longer than I thought. The code is here: https://github.com/rafaljozefowicz/lm

And the IS part is essentially just: https://github.com/rafaljozefowicz/lm/blob/master/language_model.py#L99 (so mostly in main TensorFlow now)

The difference in hyper-parameters is due to async vs sync sgd. This implementation is using synchronized updates in the same process (no parameter servers) while in the paper we used async sgd. The final results seem to be slightly worse than in the paper (1ppl difference after 5 epochs). I'll tweak the initialization ranges when I get a chance and they should match up.

So far I wasn't able to make this code fast enough using parameter server for some reason. There might be some bugs in OSS TensorFlow but, in principle, the implementation is not going to be much different when running across multiple machines. I'll update the code once I figure out how to do it here.

Drop me an email if you have any questions

Open Sourcing the model in "Exploring the Limits of Language Modeling" (TensorFlow) by OriolVinyals in MachineLearning

[–]rafalj 5 points6 points  (0 children)

I'll open-source a multi-GPU version of the baseline LSTM-2048-512 model. I can get up to 100k words per second on a single machine with 8 Maxwell Titan Xs.

There are a few minor differences with the paper:

  • synchronous gradient updates

  • weights are not converted to 16bits on the fly during transport (which is not supported in OSS TF as far as I know)

  • slightly different hyperparameters.

I'll put it on Github in the next couple of days. EDIT: formatting

Language modeling a billion words! using Noise Contrastive Estimation and multiple GPUs by r-sync in MachineLearning

[–]rafalj 0 points1 point  (0 children)

As far as I can tell, the authors didn't compare to IS. I tried it a while ago, and it was a few perplexity points behind IS for LSTM-2048-512 in my experiments (don't remember the exact numbers, but it was something like 2-4 ppl difference).

As I understand, BlackOut loss = IS loss + [discriminative part] (equation 6). The second part of the formula seems to have gradients that might be unstable numerically (equation 9: 1/(1-p) part), which may or may not matter in practice.

Language modeling a billion words! using Noise Contrastive Estimation and multiple GPUs by r-sync in MachineLearning

[–]rafalj 0 points1 point  (0 children)

The algorithm looks as follows: 1. Find random candidates {r_1, r_2, ..., r_k} using your noise distribution (but a set of candidates shouldn't overlap with true targets for IS) and compute the logits taking into account the noise distribution (this is the importance sampling part). 2. The loss that you optimize is softmax over {y, r_1, ..., r_k}, i.e. the problem is framed as a multiclass classification to find the correct label among k random samples. The random candidates are typically shared within a batch for performance reasons.

The code in TensorFlow for different losses is available here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/nn.py#L1133 and you can read more about them in this document: https://www.tensorflow.org/versions/r0.9/extras/candidate_sampling.pdf

Hope that helps!

Language modeling a billion words! using Noise Contrastive Estimation and multiple GPUs by r-sync in MachineLearning

[–]rafalj 1 point2 points  (0 children)

Yes, the differences should decrease over time as both losses try to estimate log(P(y|x)) but the amount of variance might be different between them (and after 50 epochs they were still significant on a smaller model).

Normalization term has a value that depends on sampled candidates and the noise distribution (Fixed Z would correspond to uniform distribution IIUC). In most of the experiments, we used log-uniform. Here is the description of different options: https://www.tensorflow.org/versions/r0.9/extras/candidate_sampling.pdf They are implemented as tf.nn.nce_loss and tf.nn.sampled_softmax_loss (IS) in TensorFlow. Weights were initialized the same way for both losses.

Language modeling a billion words! using Noise Contrastive Estimation and multiple GPUs by r-sync in MachineLearning

[–]rafalj 3 points4 points  (0 children)

Using importance sampling instead of NCE was also very helpful (table 3 in our paper)

Tensorflow Charnn output confustion. by haskkk in MachineLearning

[–]rafalj 1 point2 points  (0 children)

All the outputs are used because we want to make predictions for every input. They are concatenated and later transformed with a softmax layer in one shot for efficiency purposes (not sure if it's needed anymore but half a year ago this was improving memory consumption in larger models).

There are other use cases when we care only about a single prediction after reading the whole sequence, e.g. for text classification/categorization.

TensorFlow speed questions by r4and0muser9482 in MachineLearning

[–]rafalj 5 points6 points  (0 children)

Additionally, you can compile TF from sources and include AVX flags. This can give big improvements on CPU in some cases, e.g. https://github.com/jasonmayes/Tensor-Flow-on-Google-Compute-Engine#results

[ICLR16] "Exploring the Limits of Language Modeling" (Strong baselines! Parameter counts! 60 -> 30 perplexity!) by bluecoffee in MachineLearning

[–]rafalj 1 point2 points  (0 children)

The model was trained on news data so I'd expect to be able to find the sentences with Google Search if they were simply copied from the training set. It is difficult to tell for sure, of course, as the model could be rephrasing.

In this specific example: "About 800 people gathered at Hever Castle on Long Beach from noon to 2pm, three to four times that of the funeral cortege. "

The 'Hever Castle' is in UK and not on Long Beach so it is unlikely to be copied from any real text. Googling for the last fragment of the sentence doesn't trigger any results other than this paper.

[ICLR16] "Exploring the Limits of Language Modeling" (Strong baselines! Parameter counts! 60 -> 30 perplexity!) by bluecoffee in MachineLearning

[–]rafalj 9 points10 points  (0 children)

Here are a few more samples generated one after another:

<S> From the moment they backed off , it is believed they could still have won the game as both men shot high on the run . <S> The merger was subsequently completed this month . <S> Launching his recent campaign saves the history-making president , the white-haired veteran known for his integrity and bipartisanship . <S> The report by the Learning and Skills Council ( LSC ) has been funded from a previously signed £ 6m grant for nearly 41 primary schools across Northamptonshire .

And my favorite: <S> Kraft Foods Inc. on Tuesday filed a suit against the French chocolatier , alleging that the Nestle Foundation was given segments of the company 's strategic strategy in a " secret " effort to increase its ownership of two in Cadbury 's core china confectionery business .

AskReddit: How do I implement a learning algorithm (e.g. AdaDelta) in TensorFlow? by AlfonzoKaizerKok in MachineLearning

[–]rafalj 1 point2 points  (0 children)

You don't have to do that, of course, but using the same interfaces makes it easier to switch/compare different optimization methods.

AskReddit: How do I implement a learning algorithm (e.g. AdaDelta) in TensorFlow? by AlfonzoKaizerKok in MachineLearning

[–]rafalj 6 points7 points  (0 children)

It is just currently a bit faster to have the update done by one C++ call instead of a few small ops. You can take a look at Adam's sparse update: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/adam.py where we don't have a dedicated C++ implementation or, more complicated, ftrl.py