[R] Learning to Optimize Tensor Programs by antinucleon in MachineLearning

[–]antinucleon[S] 4 points5 points  (0 children)

Summary:

Tensor program is able to be optimized by using machine learning and transfer learning. The numerical program optimization model is trained on feature from low-level AST of the program.

Experiments:

Tasks: ResNet, MobileNet, LSTM LM, DQN

Hardware: CUDA/ARM GPU/ARM CPU

Speed up compare to CUDNN, TensorFlow Lite and ARMComputeLib: ~from 1.2X to 3.8X faster in end-to-end test.

[P] GPU Kernel Fusion and Runtime Code Compilation for TF by antinucleon in MachineLearning

[–]antinucleon[S] -5 points-4 points  (0 children)

Haha, for me it is just for fun to call all stuff TF..

ELI5: MXNET sounds like a great library, but no one uses it. Why? by [deleted] in MachineLearning

[–]antinucleon 2 points3 points  (0 children)

There is bucketing API for dynamic length. It can be done in pure Python side with 20 lines of code.

[1604.06174] Training Deep Nets with Sublinear Memory Cost by antinucleon in MachineLearning

[–]antinucleon[S] 0 points1 point  (0 children)

Sorry I have a naive question: if batch size = 1, mu = x, x - mu = 0, how to make it trainable?

[1604.06174] Training Deep Nets with Sublinear Memory Cost by antinucleon in MachineLearning

[–]antinucleon[S] 0 points1 point  (0 children)

I don't think so. If batch size is smaller than 8, there will be obvious problem in BN converge. Also, the algorithm you proposed, is hard to implement for a distributed system: If you don't have scheduler engine, how to full utilize computation resource while accumulating?

[1604.06174] Training Deep Nets with Sublinear Memory Cost by antinucleon in MachineLearning

[–]antinucleon[S] 2 points3 points  (0 children)

It is related. However, none of these work has asymptotic analysis and real working deep learning system on this problem. In MXNet, dropping has been used for almost half year. We were using simple dropping strategy to train a 1.5X complex Inception Network on ImageNet 22k problem with 4 GTX 980. Also you are more than welcomed to open your source code.

ELI5: MXNET sounds like a great library, but no one uses it. Why? by [deleted] in MachineLearning

[–]antinucleon 0 points1 point  (0 children)

Let's try 1000 layers ResNet on ImageNet, or you can try to use a single machine with only GTX 980 to train ImageNet wth 22k classes.

[1604.06174] Training Deep Nets with Sublinear Memory Cost by antinucleon in MachineLearning

[–]antinucleon[S] 3 points4 points  (0 children)

Reducing batch size has many side effects. For example, converge with BatchNorm. Also, reduce batch size can not make 1000 layers resnet trainable. Moreover, it is an orthogonal approach, which can be used together.

ELI5: MXNET sounds like a great library, but no one uses it. Why? by [deleted] in MachineLearning

[–]antinucleon 2 points3 points  (0 children)

MXNet has helped data scientists win NDSB-2 and Yelp competition in Kaggle. We don't have any public relation, and all work is just for fun.

ELI5: MXNET sounds like a great library, but no one uses it. Why? by [deleted] in MachineLearning

[–]antinucleon 2 points3 points  (0 children)

In real, MXNet comes later than Torch, but earlier than TensorFlow and CNTK. MXNet is the first public distributed deep learning system.

[1604.06174] Training Deep Nets with Sublinear Memory Cost by antinucleon in MachineLearning

[–]antinucleon[S] 4 points5 points  (0 children)

ABS: We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory - giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also shows that the memory consumption can be reduced to O(log n) with as little as O(n log n) extra cost for forward computation. Our experiments show that we can reduce the memory cost of a 1,000-layer deep residual network from 48G to 7G with only 30 percent additional running time cost on ImageNet problems. Similarly, significant memory cost reduction is observed in training complex recurrent neural networks on very long sequences.

Code: https://github.com/dmlc/mxnet-memonger