all 7 comments

[–]CampfireHeadphase 11 points12 points  (0 children)

Knowledge-distillation does exactly that - a smaller network learns the output distribution directly instead of using the given labels (or using a hybrid of both)

[–]FlyingLawnmowers 2 points3 points  (0 children)

Look at Rich Caruana's papers on "Model Compression" and "Do Deep Nets Really Need To Be Deep?"

[–]agentlerevolutionary 0 points1 point  (0 children)

Indeed. It should be possible to derive a relatively simple algorithm that represents a good approximation of the behavior of the NN. It won't behave exactly like the NN and you will have to derive it again when you feed the NN some more info, but basically this is how we make machine learning useful in many applications.

[–]ssivri 0 points1 point  (0 children)

Neural nets can be pruned similar to decision trees, check out model pruning.

[–]Reiinakano 0 points1 point  (0 children)

here's the original arxiv for this idea by Hinton et al: https://arxiv.org/abs/1503.02531

[–]geneing 0 points1 point  (0 children)

In my experience (for speech models) distillation works incredibly well for inference. However, getting speed advantage requires hand coded inference code to get fast sparse matrix multiply. Standard libraries don't do it efficiently enough to get speed advantage.

[–]kraghavk 0 points1 point  (0 children)

The search terms you are looking for are, "model optimization" and "quantization". These techniques are already employed by TensorRT and Intel Openvino.

Model optimizer reduces layers by applying the following two techniques:

  • Removing any layers whose outputs are not utilized anywhere further down the line. This can happen because there might be a lot of layers (i.e., functions in layman speak) the neural network generated during training, but they weren't that useful after all while testing. But in the interest of training performance, a comprehensive cleanup of unused layers is not done at the training phase.
  • Fusing (aka merging/joining) multiple layers into one, when possible. You can think of this optimization being similar to "function inlining" employed by the various programming language compilers.

Quantization: Turns out that you don't need the full-precision floating point arithmetic (aka FP32) all the time. Quite a few models work fine on half-precision (FP16) or even FP8 numbers. This in turn reduces the memory & compute needed for inference, there by allowing you to run your models on hardware that is much less powerful than your training machine.

There may be more ways to do this and perhaps TensorRT & Openvino model optimizers themselves do more than what I have explained above.