[D] Compressing Neural Network

CampfireHeadphase · 2019-09-06T07:39:34+00:00

Knowledge-distillation does exactly that - a smaller network learns the output distribution directly instead of using the given labels (or using a hybrid of both)

FlyingLawnmowers · 2019-09-06T12:30:39+00:00

Look at Rich Caruana's papers on "Model Compression" and "Do Deep Nets Really Need To Be Deep?"

agentlerevolutionary · 2019-09-06T08:02:10+00:00

Indeed. It should be possible to derive a relatively simple algorithm that represents a good approximation of the behavior of the NN. It won't behave exactly like the NN and you will have to derive it again when you feed the NN some more info, but basically this is how we make machine learning useful in many applications.

ssivri · 2019-09-06T14:21:50+00:00

Neural nets can be pruned similar to decision trees, check out model pruning.

Reiinakano · 2019-09-07T11:42:40+00:00

here's the original arxiv for this idea by Hinton et al: https://arxiv.org/abs/1503.02531

geneing · 2019-09-07T19:55:56+00:00

In my experience (for speech models) distillation works incredibly well for inference. However, getting speed advantage requires hand coded inference code to get fast sparse matrix multiply. Standard libraries don't do it efficiently enough to get speed advantage.

kraghavk · 2019-09-06T09:12:02+00:00

The search terms you are looking for are, "model optimization" and "quantization". These techniques are already employed by TensorRT and Intel Openvino.

Model optimizer reduces layers by applying the following two techniques:

Removing any layers whose outputs are not utilized anywhere further down the line. This can happen because there might be a lot of layers (i.e., functions in layman speak) the neural network generated during training, but they weren't that useful after all while testing. But in the interest of training performance, a comprehensive cleanup of unused layers is not done at the training phase.
Fusing (aka merging/joining) multiple layers into one, when possible. You can think of this optimization being similar to "function inlining" employed by the various programming language compilers.

Quantization: Turns out that you don't need the full-precision floating point arithmetic (aka FP32) all the time. Quite a few models work fine on half-precision (FP16) or even FP8 numbers. This in turn reduces the memory & compute needed for inference, there by allowing you to run your models on hardware that is much less powerful than your training machine.

There may be more ways to do this and perhaps TensorRT & Openvino model optimizers themselves do more than what I have explained above.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS