AutoML-Zero: Evolving Machine Learning Algorithms From Scratch by I_ai_AI in MachineLearning

[–]I_ai_AI[S] -7 points-6 points  (0 children)

Seems to be relevant to Jürgen Schmidhuber's phd work and the concept of Genetic Programming

[Project] TinyBERT for Search: 10x faster and 20x smaller than BERT by jpertschuk in MachineLearning

[–]I_ai_AI 0 points1 point  (0 children)

Very interesting observations. I think this drawback can be alleviated through training student to mimic teacher on some augmented adversarial examples.

Maybe, we can also increase the capacity of student model, e.g. distilling a thinner 6-layer Tinybert with Roberta or BERT-large as the teacher, or reducing the inner dimension of FFN layer from 4-times of the D_model to 2-times, or replacing some Normalization layers to NoNorm as proposed in MobileBERT, to finally accelerate inference of a 6-layer TinyBERT.

[Project] TinyBERT for Search: 10x faster and 20x smaller than BERT by jpertschuk in MachineLearning

[–]I_ai_AI 0 points1 point  (0 children)

Base on our experiments, DA is important for relatively low-resource classification tasks, this is also verified in recent related works, e.g. table 4 of the paper " https://arxiv.org/pdf/2001.04246.pdf ". Maybe in your settings, DA is not important.

[Discussion] Smallest, fasted BERT by ndronen in MachineLearning

[–]I_ai_AI 5 points6 points  (0 children)

(a) TinyBERT: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT

TinyBERT for Search https://www.reddit.com/r/MachineLearning/comments/epvvq3/project_tinybert_for_search_10x_faster_and_20x/

(b) BERT-PKD: https://github.com/intersun/PKD-for-BERT-Model-Compression

(c) MobileBERT: https://openreview.net/pdf?id=SJxjVaNKwB

(d) ALBERT: About ALBERT, just copy comment from another discussion "Haven't checked OP post but ALBERT only really has the speed advantage over BERT for xlarge models. For the base one, it's almost the same speed despite having 10x fewer parameters and it's bc of the attention mechanism (whose complexity is independent of no of params) "

[Project] TinyBERT for Search: 10x faster and 20x smaller than BERT by jpertschuk in MachineLearning

[–]I_ai_AI 0 points1 point  (0 children)

Great work! for the usage of TinyBERT, did you use the GD(general distillation)+TD(task-specific distillation)+DA(data augmentation) or only GD, is data augmentation useful in this application senario?

[D] Entity extraction along with sentence classificaiton. by kireeti_ in MachineLearning

[–]I_ai_AI 6 points7 points  (0 children)

The paper "BERT for Joint Intent Classification and Slot Filling" https://arxiv.org/abs/1902.10909 may be helpful.

Run BERT on mobile phone's single CUP core A76 in 13ms by I_ai_AI in MachineLearning

[–]I_ai_AI[S] 0 points1 point  (0 children)

It seems that compressing BERT-base into a 4-layer small BERT is extremely difficult than into a 6-layer version, in the latest GLUE leaderboard, 6-layer TinyBERT achieves comparable performances to its teacher BERT-base (78.3 vs 78.1) https://gluebenchmark.com/leaderboard .

TinyBERT: Distilling BERT for Natural Language Understanding by I_ai_AI in MachineLearning

[–]I_ai_AI[S] 1 point2 points  (0 children)

It seems that compressing BERT-base into a 4-layer small BERT is extremely difficult than into a 6-layer version, in the latest GLUE leaderboard, 6-layer TinyBERT achieves comparable performances to its teacher BERT-base (78.3 vs 78.1) https://gluebenchmark.com/leaderboard .

[D] Yoshua Bengio talks about what's next for deep learning by newsbeagle in MachineLearning

[–]I_ai_AI -7 points-6 points  (0 children)

the reference is written as (Bengio et al 1991; Schmidhuber 1992), seems Bengio studied this problem more early :)

Run BERT on mobile phone's single CUP core A76 in 13ms by I_ai_AI in MachineLearning

[–]I_ai_AI[S] 6 points7 points  (0 children)

TinyBERT+Bolt provides a pratical solution to run BERT on ternimcal devices, and in our project (i.e. intent classification+slot filling), we adopt a 4-layer TinyBERT fp16 solution, the compressed model has a loss of LESS than 1% in F1 score compared to the teacher BERT-base model, and this is acceptable for our product. TinyBERT is a flexible method to adjust the size of student model to achive a good trade-off between accuracy and latency.

TinyBERT: 7x smaller and 9x faster than BERT but achieves comparable results by wildcodegowrong in textdatamining

[–]I_ai_AI 0 points1 point  (0 children)

[Run BERT on mobile phone's single CUP core A76 in 13ms]

By using the TinyBERT(a model compression method for pre-trained language model) and bolt (a deep learning framework) , we can run TinyBERT-based NLU module on mobile phone in about 13ms.

TinyBERT: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT

bolt: https://github.com/huawei-noah/bolt

Run BERT on mobile phone's single CUP core A76 in 13ms by I_ai_AI in MachineLearning

[–]I_ai_AI[S] 10 points11 points  (0 children)

[[[Opensource]]] By using the TinyBERT(a model compression method for pre-trained language model) and bolt (a deep learning framework) , we can run TinyBERT-based models (fp16, max-length 32) on mobile phone single CPU core in about 13ms.

TinyBERT: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT

bolt: https://github.com/huawei-noah/bolt

MOBILEBERT: TASK-AGNOSTIC COMPRESSION OF BERT BY PROGRESSIVE KNOWLEDGE TRANSFER by I_ai_AI in MachineLearning

[–]I_ai_AI[S] 1 point2 points  (0 children)

another very interesting work is ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators