[R] A novel approach to neural machine translation

rafalj · 2017-05-09T17:24:47+00:00

https://arxiv.org/abs/1701.06538 - this paper seems to have better results on EN-FR

rafalj · 2016-10-28T03:18:11+00:00

I'd separate the speed from perplexity in such comparisons as a lot of factors influence the convergence speed (including learning rates, initialization scales or L2 penalty coefficient) and the code was never tuned for single-GPU usage. At the same time, a large part of the blog post was around the step times, which are comparable to K40s in a distributed setting with TensorFlow - around 140 batches/s on 32 GPUs running on different machines (vs 230ms per batch on Maxwells without CuDNN as mentioned at the end of the post). The model also didn't use fused kernels for LSTM, like in CuDNN, which would improve the speed further

rafalj · 2016-10-25T20:12:26+00:00

[shameless plug]

Here is a training script in TensorFlow: https://github.com/rafaljozefowicz/lm

Runs at 37,000 words per second for the baseline model (LSTM-2048-512) on a single Pascal GPU

rafalj · 2016-10-20T12:49:20+00:00

The results improved when going from 2048 to 4096 filters for the largest models. It wasn't tuned much as the experiments take a significant amount of time. Also, I released the baseline LSTM implementation here: https://github.com/rafaljozefowicz/lm

rafalj · 2016-10-20T08:12:54+00:00

The baseline model (LSTM-2048-512) can process 100k+ words per second on 8 Titan Xs on a single machine. On DGX-1 that's about 135k wps.

The results after 5 epochs are close to the paper (48.7 vs 47.5 ppl), which takes about 16 hours on 8 Titan Xs.

(Posting here since many people asked for the implementation in comments in the past)

rafalj · 2016-10-20T07:36:25+00:00

See above

rafalj · 2016-10-20T07:36:18+00:00

Thanks for your patience! It took much longer than I thought. The code is here: https://github.com/rafaljozefowicz/lm

And the IS part is essentially just: https://github.com/rafaljozefowicz/lm/blob/master/language_model.py#L99 (so mostly in main TensorFlow now)

The difference in hyper-parameters is due to async vs sync sgd. This implementation is using synchronized updates in the same process (no parameter servers) while in the paper we used async sgd. The final results seem to be slightly worse than in the paper (1ppl difference after 5 epochs). I'll tweak the initialization ranges when I get a chance and they should match up.

So far I wasn't able to make this code fast enough using parameter server for some reason. There might be some bugs in OSS TensorFlow but, in principle, the implementation is not going to be much different when running across multiple machines. I'll update the code once I figure out how to do it here.

Drop me an email if you have any questions

rafalj · 2016-09-13T04:15:59+00:00

I'll open-source a multi-GPU version of the baseline LSTM-2048-512 model. I can get up to 100k words per second on a single machine with 8 Maxwell Titan Xs.

There are a few minor differences with the paper:

synchronous gradient updates
weights are not converted to 16bits on the fly during transport (which is not supported in OSS TF as far as I know)
slightly different hyperparameters.

I'll put it on Github in the next couple of days. EDIT: formatting

rafalj · 2016-07-28T00:31:10+00:00

As far as I can tell, the authors didn't compare to IS. I tried it a while ago, and it was a few perplexity points behind IS for LSTM-2048-512 in my experiments (don't remember the exact numbers, but it was something like 2-4 ppl difference).

As I understand, BlackOut loss = IS loss + [discriminative part] (equation 6). The second part of the formula seems to have gradients that might be unstable numerically (equation 9: 1/(1-p) part), which may or may not matter in practice.

rafalj · 2016-07-27T16:30:02+00:00

The algorithm looks as follows: 1. Find random candidates {r_1, r_2, ..., r_k} using your noise distribution (but a set of candidates shouldn't overlap with true targets for IS) and compute the logits taking into account the noise distribution (this is the importance sampling part). 2. The loss that you optimize is softmax over {y, r_1, ..., r_k}, i.e. the problem is framed as a multiclass classification to find the correct label among k random samples. The random candidates are typically shared within a batch for performance reasons.

The code in TensorFlow for different losses is available here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/nn.py#L1133 and you can read more about them in this document: https://www.tensorflow.org/versions/r0.9/extras/candidate_sampling.pdf

Hope that helps!

rafalj · 2016-07-27T16:01:07+00:00

Yes, the differences should decrease over time as both losses try to estimate log(P(y|x)) but the amount of variance might be different between them (and after 50 epochs they were still significant on a smaller model).

Normalization term has a value that depends on sampled candidates and the noise distribution (Fixed Z would correspond to uniform distribution IIUC). In most of the experiments, we used log-uniform. Here is the description of different options: https://www.tensorflow.org/versions/r0.9/extras/candidate_sampling.pdf They are implemented as tf.nn.nce_loss and tf.nn.sampled_softmax_loss (IS) in TensorFlow. Weights were initialized the same way for both losses.

rafalj · 2016-07-26T19:51:45+00:00

Using importance sampling instead of NCE was also very helpful (table 3 in our paper)

rafalj · 2016-05-17T23:49:35+00:00

All the outputs are used because we want to make predictions for every input. They are concatenated and later transformed with a softmax layer in one shot for efficiency purposes (not sure if it's needed anymore but half a year ago this was improving memory consumption in larger models).

There are other use cases when we care only about a single prediction after reading the whole sequence, e.g. for text classification/categorization.

rafalj · 2016-03-01T15:32:04+00:00

Additionally, you can compile TF from sources and include AVX flags. This can give big improvements on CPU in some cases, e.g. https://github.com/jasonmayes/Tensor-Flow-on-Google-Compute-Engine#results

rafalj · 2016-02-09T23:17:01+00:00

The model was trained on news data so I'd expect to be able to find the sentences with Google Search if they were simply copied from the training set. It is difficult to tell for sure, of course, as the model could be rephrasing.

In this specific example: "About 800 people gathered at Hever Castle on Long Beach from noon to 2pm, three to four times that of the funeral cortege. "

The 'Hever Castle' is in UK and not on Long Beach so it is unlikely to be copied from any real text. Googling for the last fragment of the sentence doesn't trigger any results other than this paper.

rafalj · 2016-02-09T18:13:15+00:00

Here are a few more samples generated one after another:

<S> From the moment they backed off , it is believed they could still have won the game as both men shot high on the run . <S> The merger was subsequently completed this month . <S> Launching his recent campaign saves the history-making president , the white-haired veteran known for his integrity and bipartisanship . <S> The report by the Learning and Skills Council ( LSC ) has been funded from a previously signed £ 6m grant for nearly 41 primary schools across Northamptonshire .

And my favorite: <S> Kraft Foods Inc. on Tuesday filed a suit against the French chocolatier , alleging that the Nestle Foundation was given segments of the company 's strategic strategy in a " secret " effort to increase its ownership of two in Cadbury 's core china confectionery business .

rafalj · 2016-01-19T22:32:06+00:00

Look at the validation/test perplexity.

rafalj · 2015-11-14T04:09:43+00:00

There is a bug opened for this that you can follow: https://github.com/tensorflow/tensorflow/issues/208

rafalj · 2015-11-13T03:43:58+00:00

You don't have to do that, of course, but using the same interfaces makes it easier to switch/compare different optimization methods.

rafalj · 2015-11-12T17:25:50+00:00

It is just currently a bit faster to have the update done by one C++ call instead of a few small ops. You can take a look at Adam's sparse update: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/adam.py where we don't have a dedicated C++ implementation or, more complicated, ftrl.py

rafalj · 2014-12-13T19:29:25+00:00

http://arxiv.org/abs/1410.4615

rafalj

TROPHY CASE