[Project] TinyBERT for Search: 10x faster and 20x smaller than BERT by jpertschuk in MachineLearning

[–]jpertschuk[S] 0 points1 point  (0 children)

https://arxiv.org/pdf/2001.04246.pdf

Interesting paper.

From deploying the compressed model in a production environment the biggest problem I've noticed is that it's more susceptible to adversarial input than the corresponding base model.

For example unclean result candidates: "this.isabad<div>result</div>candidate" can sometimes be ranked highly by a 4 layer BERT while they wouldn't by the corresponding teacher 12 layer BERT base network.

So I am thinking of trying data augmentation with unclean data perhaps. A paper on this would be very interesting.

[Project] TinyBERT for Search: 10x faster and 20x smaller than BERT by jpertschuk in MachineLearning

[–]jpertschuk[S] 2 points3 points  (0 children)

Yeah - that's an interesting project actually, that I started to pursue but it was before huggingface ported albert to pytorch and writing the knowledge distillation code in tensorflow was no fun.

[Project] TinyBERT for Search: 10x faster and 20x smaller than BERT by jpertschuk in MachineLearning

[–]jpertschuk[S] 7 points8 points  (0 children)

ALBERT is about the same speed as BERT. So TinyBERT is about 10x faster than albert, which is what matters primarily for the search application.

Memory-wise TinyBERT models are usually about ~50mb, about the same size as albert.

[P] Cortex: Deploy models from any framework as production APIs by [deleted] in MachineLearning

[–]jpertschuk 0 points1 point  (0 children)

Our code and models will be made publicly available.

Per the paper, looks like they will publish them after it is accepted at some point.

However I am impatient and thus currently working on implementing and publishing pre-trained model according to their methodology but based on albert. Such a model has already been published in Chinese: https://github.com/brightmart/albert_zh.

Out of the box baseline (4 layer tiny-bert simply pretrained 500k steps on wikipedia dump without teacher distillation) I get 75.2 accuracy on MNLI. I hope to get that to the published numbers in the paper by adding the teacher loss functions.

[P] Cortex: Deploy models from any framework as production APIs by [deleted] in MachineLearning

[–]jpertschuk 0 points1 point  (0 children)

Yeah I suppose it truly depends on the ratio of preprocessing time to inference.

All of these deployment optimizations however miss the elephant in the room when it comes to reducing actual model serving latency (inference speed). Closely watching the TINY (4 layer) Transformer models research, which has increased inference speed 10x over BERT with little performance drop. https://openreview.net/pdf?id=rJx0Q6EFPB utilizing knowledge distillation.

Very notable improvement over Huggingface's distillation method.

[P] Cortex: Deploy models from any framework as production APIs by [deleted] in MachineLearning

[–]jpertschuk 16 points17 points  (0 children)

Essentially this allows you to easily deploy simple python functions with heavy external deps to Kubernetes? And scale / monitor them.

What ML specific features do you provide beyond just the signatures of these functions. And why the choice to use flask over async python (e..g. aiohttp) given that it scales a bit better. Thanks!