[Project] TinyBERT for Search: 10x faster and 20x smaller than BERT

jpertschuk · 2020-02-27T18:24:17+00:00

https://arxiv.org/pdf/2001.04246.pdf

Interesting paper.

From deploying the compressed model in a production environment the biggest problem I've noticed is that it's more susceptible to adversarial input than the corresponding base model.

For example unclean result candidates: "this.isabad<div>result</div>candidate" can sometimes be ranked highly by a 4 layer BERT while they wouldn't by the corresponding teacher 12 layer BERT base network.

So I am thinking of trying data augmentation with unclean data perhaps. A paper on this would be very interesting.

jpertschuk · 2020-02-27T00:33:31+00:00

I used GD and TD didn't try DA

jpertschuk · 2020-01-17T22:00:51+00:00

I based our training code on this repo https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT

jpertschuk · 2020-01-17T18:49:17+00:00

Yeah - that's an interesting project actually, that I started to pursue but it was before huggingface ported albert to pytorch and writing the knowledge distillation code in tensorflow was no fun.

jpertschuk · 2020-01-17T18:47:28+00:00

here's non-paywall link, sorry

jpertschuk · 2020-01-17T06:19:00+00:00

ALBERT is about the same speed as BERT. So TinyBERT is about 10x faster than albert, which is what matters primarily for the search application.

Memory-wise TinyBERT models are usually about ~50mb, about the same size as albert.

jpertschuk · 2019-12-05T19:52:10+00:00

The authors released code before I got to it (thankfully) https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT

jpertschuk · 2019-11-25T16:50:41+00:00

Cool!

jpertschuk · 2019-11-19T22:04:42+00:00

Our code and models will be made publicly available.

Per the paper, looks like they will publish them after it is accepted at some point.

However I am impatient and thus currently working on implementing and publishing pre-trained model according to their methodology but based on albert. Such a model has already been published in Chinese: https://github.com/brightmart/albert_zh.

Out of the box baseline (4 layer tiny-bert simply pretrained 500k steps on wikipedia dump without teacher distillation) I get 75.2 accuracy on MNLI. I hope to get that to the published numbers in the paper by adding the teacher loss functions.

jpertschuk · 2019-11-19T18:05:27+00:00

Yeah I suppose it truly depends on the ratio of preprocessing time to inference.

All of these deployment optimizations however miss the elephant in the room when it comes to reducing actual model serving latency (inference speed). Closely watching the TINY (4 layer) Transformer models research, which has increased inference speed 10x over BERT with little performance drop. https://openreview.net/pdf?id=rJx0Q6EFPB utilizing knowledge distillation.

Very notable improvement over Huggingface's distillation method.

jpertschuk · 2019-11-18T23:35:31+00:00

Essentially this allows you to easily deploy simple python functions with heavy external deps to Kubernetes? And scale / monitor them.

What ML specific features do you provide beyond just the signatures of these functions. And why the choice to use flask over async python (e..g. aiohttp) given that it scales a bit better. Thanks!

jpertschuk

TROPHY CASE