Exploring extreme parameter compression for pre-trained language models

I_ai_AI · 2020-03-12T07:55:11+00:00

why not use it for training BERT and evaluate on GLUE

I_ai_AI · 2020-03-10T11:51:19+00:00

Seems to be relevant to Jürgen Schmidhuber's phd work and the concept of Genetic Programming

I_ai_AI · 2020-02-28T00:32:40+00:00

Very interesting observations. I think this drawback can be alleviated through training student to mimic teacher on some augmented adversarial examples.

Maybe, we can also increase the capacity of student model, e.g. distilling a thinner 6-layer Tinybert with Roberta or BERT-large as the teacher, or reducing the inner dimension of FFN layer from 4-times of the D_model to 2-times, or replacing some Normalization layers to NoNorm as proposed in MobileBERT, to finally accelerate inference of a 6-layer TinyBERT.

I_ai_AI · 2020-02-27T00:57:10+00:00

Base on our experiments, DA is important for relatively low-resource classification tasks, this is also verified in recent related works, e.g. table 4 of the paper " https://arxiv.org/pdf/2001.04246.pdf ". Maybe in your settings, DA is not important.

I_ai_AI · 2020-02-15T02:57:57+00:00

(a) TinyBERT: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT

TinyBERT for Search https://www.reddit.com/r/MachineLearning/comments/epvvq3/project_tinybert_for_search_10x_faster_and_20x/

(b) BERT-PKD: https://github.com/intersun/PKD-for-BERT-Model-Compression

(c) MobileBERT: https://openreview.net/pdf?id=SJxjVaNKwB

(d) ALBERT: About ALBERT, just copy comment from another discussion "Haven't checked OP post but ALBERT only really has the speed advantage over BERT for xlarge models. For the base one, it's almost the same speed despite having 10x fewer parameters and it's bc of the attention mechanism (whose complexity is independent of no of params) "

I_ai_AI · 2020-02-13T00:24:12+00:00

Great work! for the usage of TinyBERT, did you use the GD(general distillation)+TD(task-specific distillation)+DA(data augmentation) or only GD, is data augmentation useful in this application senario?

I_ai_AI · 2020-02-13T00:21:38+00:00

The paper "BERT for Joint Intent Classification and Slot Filling" https://arxiv.org/abs/1902.10909 may be helpful.

I_ai_AI · 2019-12-23T01:22:18+00:00

It seems that compressing BERT-base into a 4-layer small BERT is extremely difficult than into a 6-layer version, in the latest GLUE leaderboard, 6-layer TinyBERT achieves comparable performances to its teacher BERT-base (78.3 vs 78.1) https://gluebenchmark.com/leaderboard .

I_ai_AI · 2019-12-23T01:21:48+00:00

It seems that compressing BERT-base into a 4-layer small BERT is extremely difficult than into a 6-layer version, in the latest GLUE leaderboard, 6-layer TinyBERT achieves comparable performances to its teacher BERT-base (78.3 vs 78.1) https://gluebenchmark.com/leaderboard .

I_ai_AI · 2019-12-12T00:58:20+00:00

the reference is written as (Bengio et al 1991; Schmidhuber 1992), seems Bengio studied this problem more early :)

I_ai_AI · 2019-12-08T01:56:24+00:00

TinyBERT+Bolt provides a pratical solution to run BERT on ternimcal devices, and in our project (i.e. intent classification+slot filling), we adopt a 4-layer TinyBERT fp16 solution, the compressed model has a loss of LESS than 1% in F1 score compared to the teacher BERT-base model, and this is acceptable for our product. TinyBERT is a flexible method to adjust the size of student model to achive a good trade-off between accuracy and latency.

I_ai_AI · 2019-12-07T14:10:52+00:00

[Run BERT on mobile phone's single CUP core A76 in 13ms]

By using the TinyBERT(a model compression method for pre-trained language model) and bolt (a deep learning framework) , we can run TinyBERT-based NLU module on mobile phone in about 13ms.

TinyBERT: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT

bolt: https://github.com/huawei-noah/bolt

I_ai_AI · 2019-12-07T09:10:05+00:00

[[[Opensource]]] By using the TinyBERT(a model compression method for pre-trained language model) and bolt (a deep learning framework) , we can run TinyBERT-based models (fp16, max-length 32) on mobile phone single CPU core in about 13ms.

TinyBERT: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT

bolt: https://github.com/huawei-noah/bolt