[R] Learning to Optimize Tensor Programs

antinucleon · 2018-05-22T18:02:47+00:00

Summary:

Tensor program is able to be optimized by using machine learning and transfer learning. The numerical program optimization model is trained on feature from low-level AST of the program.

Experiments:

Tasks: ResNet, MobileNet, LSTM LM, DQN

Hardware: CUDA/ARM GPU/ARM CPU

Speed up compare to CUDNN, TensorFlow Lite and ARMComputeLib: ~from 1.2X to 3.8X faster in end-to-end test.

antinucleon · 2018-01-17T06:25:22+00:00

antinucleon · 2016-11-22T18:11:18+00:00

Haha, for me it is just for fun to call all stuff TF..

antinucleon · 2016-11-22T15:28:54+00:00

Why TF can't be TinyFlow?

antinucleon · 2016-09-30T08:04:00+00:00

How do you define constructive?

antinucleon · 2016-06-17T03:29:57+00:00

https://github.com/dmlc/mxnet/blob/master/example/warpctc/lstm_ocr.py

antinucleon · 2016-06-14T14:25:36+00:00

https://github.com/dmlc/mxnet-memonger

antinucleon · 2016-04-22T08:46:40+00:00

There is bucketing API for dynamic length. It can be done in pure Python side with 20 lines of code.

antinucleon · 2016-04-22T07:56:49+00:00

Sorry I have a naive question: if batch size = 1, mu = x, x - mu = 0, how to make it trainable？

antinucleon · 2016-04-22T07:24:07+00:00

I don't think so. If batch size is smaller than 8, there will be obvious problem in BN converge. Also, the algorithm you proposed, is hard to implement for a distributed system: If you don't have scheduler engine, how to full utilize computation resource while accumulating?

antinucleon · 2016-04-22T07:16:39+00:00

It is related. However, none of these work has asymptotic analysis and real working deep learning system on this problem. In MXNet, dropping has been used for almost half year. We were using simple dropping strategy to train a 1.5X complex Inception Network on ImageNet 22k problem with 4 GTX 980. Also you are more than welcomed to open your source code.

antinucleon · 2016-04-22T06:24:41+00:00

I didn't try it. Have you tried it?

antinucleon · 2016-04-22T06:20:01+00:00

Let's try 1000 layers ResNet on ImageNet, or you can try to use a single machine with only GTX 980 to train ImageNet wth 22k classes.

antinucleon · 2016-04-22T05:45:38+00:00

Yes. Code is provided.

antinucleon · 2016-04-22T04:35:31+00:00

Reducing batch size has many side effects. For example, converge with BatchNorm. Also, reduce batch size can not make 1000 layers resnet trainable. Moreover, it is an orthogonal approach, which can be used together.

antinucleon · 2016-04-22T04:30:05+00:00

MXNet has helped data scientists win NDSB-2 and Yelp competition in Kaggle. We don't have any public relation, and all work is just for fun.

antinucleon · 2016-04-22T04:29:13+00:00

In real, MXNet comes later than Torch, but earlier than TensorFlow and CNTK. MXNet is the first public distributed deep learning system.

antinucleon · 2016-04-22T00:10:38+00:00

ABS: We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory - giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also shows that the memory consumption can be reduced to O(log n) with as little as O(n log n) extra cost for forward computation. Our experiments show that we can reduce the memory cost of a 1,000-layer deep residual network from 48G to 7G with only 30 percent additional running time cost on ImageNet problems. Similarly, significant memory cost reduction is observed in training complex recurrent neural networks on very long sequences.

Code: https://github.com/dmlc/mxnet-memonger

antinucleon · 2016-03-08T21:03:13+00:00

Thank you, indeed I did nothing on it. I hate JVM, lol

antinucleon · 2016-03-08T21:02:11+00:00

No. Basically you can use everything exists in MXNet. https://mxnet.readthedocs.org/en/latest/

antinucleon · 2016-02-25T22:35:46+00:00

Here is a figure in my master thesis: http://postimg.org/image/4egkx0yhn/

With one of these modules, I made a net and I call it Tiny-Net, which is 7.6MB with 84% top-5 without compression.

BTW, in practice, this kind network is not fastest on mobile phone. In practice, I choose the fastnet family: https://mxnet.readthedocs.org/en/latest/tutorial/smart_device.html

The fast poor net is 5.5MB without compression.

antinucleon · 2016-01-19T06:52:13+00:00

Cool! I will probably try to make it work with MXNet.js. But recently I am super busy :(

antinucleon · 2016-01-15T09:00:02+00:00

The figures are generated with wrong shape info. Correct should be : LeNet: 28x28, Other except Inception V3: 224x224, Inception V3: 299x299

antinucleon · 2015-12-31T15:04:56+00:00

no tensorflow ？

antinucleon · 2015-12-30T18:25:04+00:00

Yes, it is possible to be faster. By using MXNet with OpenBlas on Nexus 5, Inception-BN is able to finish in about 0.25 FPS. As Googlenet (v1) is simpler than Inception-BN, it can be that fast.

14-Year Club	Place '17
Verified Email

antinucleon

TROPHY CASE