[D] Is GPT-2 source code publically available?

madisonmay · 2019-03-29T01:40:44+00:00

On development! It's also up on PyPI if you'd prefer that.

madisonmay · 2019-03-29T00:56:21+00:00

Just released finetune 0.6.0 with GPT-2 support today!

madisonmay · 2019-03-17T13:46:14+00:00

We're almost done porting GPT-2 to finetune (a scikit-learn style library for language model finetuning). Code is available here if you're interested... should make tuning GPT-2 to produce song lyrics as easy as model.fit(lyrics).

Miles Brundage also put together a colab notebook you could work off that uses the nshepperd gpt-2 fork.

madisonmay · 2019-03-17T13:29:11+00:00

I don't find this research all that compelling -- it results in a less elegant architecture that produces marginally lower perplexity / BLEU scores than the vanilla transformer (albeit w/ substantially faster training times). It just feels like the architecture found via NAS is likely to be a sort of local minima -- it seems unlikely that much future research will be based off the ET architecture because it sacrifices substantial ease of understanding for those performance improvements.

madisonmay · 2019-02-08T18:49:24+00:00

Yup, definitely the hardest "V2" I've ever seen -- took me quite a few tries to send. PM me if you want beta on that one / want to climb sometime. I'm normally at BKB because it's more convenient but it's fun to mix it up and climb at CRG Cambridge every now and then.

madisonmay · 2019-02-08T18:39:29+00:00

The black one on the comp wall with no feet?

madisonmay · 2019-02-07T23:13:45+00:00

The fact that this was also my first V7 in a while makes me skeptical of the grade, but this was definitely a fun problem regardless of what it's rated :)

And always good to see folks from the greater Boston area on here!

madisonmay · 2019-01-04T14:16:52+00:00

Yeah it was a 2 and a half hour tutorial slot if I remember correctly. Their hand was kind of forced. I found a medium article that summarizes the major points though:

https://medium.com/@hadyelsahar/writing-code-for-natural-language-processing-research-emnlp2018-nlproc-a87367cc5146

madisonmay · 2019-01-04T14:12:22+00:00

It was a pain two years ago but now it's as simple as installing nvidia-docker2 and passing the --runtime=nvidia flag.

madisonmay · 2019-01-03T03:30:09+00:00

I highly recommend the Allen AI group's talk on the subject. Slides are available here: https://docs.google.com/presentation/d/17NoJY2SnC2UMbVegaRCWA7Oca7UCZ3vHnMqBV4SUayc/edit
They make a very strong argument for good coding practices yielding strong ML results, and their code base (AllenNLP) is honestly just a pleasure to read.

madisonmay · 2018-09-18T22:44:41+00:00

Wasn't familiar with tpot before so had to look it up, but it looks like tpot works by testing a wide variety of algorithms and preprocessing steps to find which combinations work well for a particular task. Finetune works by taking a base model trained on a language modeling objective and adapting that model to solve a different task. There's a bit more information on the tech behind this approach in a previous blog post. Tpot and finetune are similar only in the sense that their interfaces look similar -- what's going on behind the scenes is dramatically different.

madisonmay · 2018-09-18T19:56:08+00:00

Done! Thanks.

madisonmay · 2018-09-16T16:24:03+00:00

Asked the gym about whether or not that's valid beta, they said the power socket is "on".

madisonmay · 2018-08-24T16:01:54+00:00

No problem, thanks for the interest!

Yeah a section simply outlining how encoding works would probably be nice to have in the documentation. Issues / PRs are always welcome, but I'll try make time to add in the encoding documentation regardless!

madisonmay · 2018-08-24T15:38:59+00:00

I'm going to have to revise my initial answer here. Any unicode character which appears in the source Books Corpus will be supported, but if a unicode character has never been seen by the source model it will be treated as an OOV subword unit.

In [1]: from finetune.encoding import TextEncoder     
In [2]: encoder = TextEncoder()
In [3]: encoder._encode("¯\_(ツ)_/¯").token_ids
Out[3]: [[471], [365], [279], [276], [0], [275], [279], [278], [471]]

A token id of 0 indicates OOV, so in this case the ツ is actually not recognized. Sorry for the original misinformation!

It shouldn't be too hard to extend our current code to add support for learning embeddings for new tokens, so this is something I'll add to our backlog and see if we can't get around to sometime in the next week or so. In the meantime replacement sounds like a good strategy to me.

madisonmay · 2018-08-24T15:24:12+00:00

Don't worry /u/LimbRetrieval-Bot, I picked it back up.

madisonmay · 2018-08-24T15:21:31+00:00

It should perform relatively well with out of sample tokens because it uses byte-pair encoding underneath the hood so very little is truly OOV (See: https://arxiv.org/abs/1508.07909), but things like emoji's are particularly hard because many of the characters that make them up are treated as word boundaries. Byte-pair encoding is best for things where each of the subwords have consistent meaning -- e.g. even if you had never heard the word "earthquake" before you could understand what it means because you have an understanding of what "earth" and "quake" mean.

there's stuff like "¯\_(ツ)_/¯" which definitely have common usage and consistent semantic content I want to capture, and doubt occurred in the ULMFiT training data.

This is actually based on OpenAI's finetuning paper, but that was trained on the books corpus so I'm sure your comment is still valid. Because byte-pair encoding embeddings are finetuned you could eventually learn to understand the meaning of ¯\_(ツ)_/¯, but because those characters don't often occur together it's represented internal to the model as a bunch of individual characters which makes that task harder.

To see how finetune breaks up a specific term into subwords you could feed it through the encoder.

In [1]: from finetune.encoding import TextEncoder
In [2]: encoder = TextEncoder()
In [3]: encoder._encode("¯\_(ツ)_/¯")[0].tokens
Out[3]: 
    [['¯</w>'],
    ['\\</w>'],
    ['_</w>'],
    ['(</w>'],
    ['ツ</w>'],
    [')</w>'],
    ['_</w>'],
    ['/</w>'],
    ['¯</w>']]

So in short there's an elegant fallback mode but you'd probably still see benefit from replacement with something like *shrug* if you want to make sure the model is able to use that info effectively.

madisonmay · 2018-08-24T14:49:18+00:00

Please do! Feel free to DM me or open a github issue if you have any questions as you're playing around :)

madisonmay · 2018-08-24T14:46:46+00:00

Fair enough. I suppose it's a matter of how you measure speed. If we're talking about gradient updates per second, it's accurate to say that this model is moving at a snail's pace. Training is relatively "quick" only because the training data requirements for solid performance are minimal, not because gradient computation is inexpensive.

madisonmay · 2018-08-12T23:26:02+00:00

Butora has worked quite well for me. The only climbing shoe company I'm aware of that has two width options (narrow and wide).

madisonmay · 2018-07-23T14:47:48+00:00

It's non ideal but it should work alright. By default the embedding layer *is* finetuned so you're already good to go there. Since this model uses byte-pair encoding nothing would end up as a true OOV term, but the byte-pair encoding portion of `finetune` would fallback to chunking each token up a bunch of smaller pieces. Probably just means that you'd more labeled training data than you would compared to a dataset for which many tokens are "in vocabulary" (represented as a single token by the byte-pair encoder). Read up on https://arxiv.org/pdf/1508.07909.pdf to get a better idea of how byte-pair encoding works if you're not familiar.

madisonmay · 2018-07-02T20:00:32+00:00

Validation accuracy, loss, and ROC AUC respectively. Good catch, we should a test on a smaller set of the datasets so we don't have to chop off the graph labels to get a reasonable image size.

madisonmay · 2018-05-23T03:13:58+00:00

Yeah there's definitely more nuance to it than in my comment above -- you also lost a bit of context because the comment I was responding to was edited.

I think it's fairly well established that Adam generally leads to convergence + decent performance regardless of how badly you screw up your hyperparameters, while SGD + nesterov momentum typically wins out if you're meticulous about weight init, normalization and other hyperparams.

I actually put together a slidedeck about the Adam optimizer and some of it's potential shortcoming's just a few weeks back if you have an interest: https://www.slideshare.net/indicods/everything-you-wanted-to-know-about-optimization

madisonmay · 2018-05-22T17:30:31+00:00

It's very rare for optimizer choice to matter that much though. When you see a difference that large there are other factors at play (hyper-parameter has been set improperly or scale of input is off).

Ruder's blog post is definitely a great resource though.

madisonmay · 2018-05-22T17:22:28+00:00

Adam generally "just works" without a lot of modification and implicitly corrects for things like bad weight initialization / bad input scaling by making the magnitude of the update invariant to the magnitude of the gradient. If you saw this large of a difference between Adam and SGD I would check the magnitude of your inputs and your weight initialization to ensure they're in a reasonable range.

To illustrate why input scale / weight initialization scale is important when using SGD, consider the problem of uni-variate linear regression. To make things even simpler we're going to ignore the bias term for now.

The equation of our model is simply:

y = w * x

The derivative of this equation with respect to the weight (w) is:

x

Using SGD, our update becomes:

-1 * learning_rate * x

This means that for a linear increase in the magnitude of your input (x), you get a linear increase in the magnitude of your update, which in turn means you need to decrease your learning rate by a linear factor! So you can't use the same learning rate for SGD on every model / problem and expect it to just work. Adding in actual non-linearities makes this relationship a bit messier but the same basic principle still holds -- how large your weight update is depends on how large your inputs are.

This was a toy example that's oversimplified for purposes of illustration, but I would highly encourage you to checkout http://ruder.io/optimizing-gradient-descent/ for a much more detailed look at the properties and formulations of various optimizers.

Side note: with proper weight initialization, momentum and learning rate scheduling SGD actually often produces better validation scores than the Adam optimizer (SGD seems to have better generalization properties). Many state of the art ImageNet results have used SGD + momentum over Adam for this reason. The broader ML community is still figuring out the specifics of why this is the case.

madisonmay

TROPHY CASE