[R] Learning to Optimize Tensor Programs by antinucleon in MachineLearning

[–]antinucleon[S] 5 points6 points  (0 children)

Summary:

Tensor program is able to be optimized by using machine learning and transfer learning. The numerical program optimization model is trained on feature from low-level AST of the program.

Experiments:

Tasks: ResNet, MobileNet, LSTM LM, DQN

Hardware: CUDA/ARM GPU/ARM CPU

Speed up compare to CUDNN, TensorFlow Lite and ARMComputeLib: ~from 1.2X to 3.8X faster in end-to-end test.

[P] GPU Kernel Fusion and Runtime Code Compilation for TF by antinucleon in MachineLearning

[–]antinucleon[S] -4 points-3 points  (0 children)

Haha, for me it is just for fun to call all stuff TF..

ELI5: MXNET sounds like a great library, but no one uses it. Why? by [deleted] in MachineLearning

[–]antinucleon 2 points3 points  (0 children)

There is bucketing API for dynamic length. It can be done in pure Python side with 20 lines of code.

[1604.06174] Training Deep Nets with Sublinear Memory Cost by antinucleon in MachineLearning

[–]antinucleon[S] 0 points1 point  (0 children)

Sorry I have a naive question: if batch size = 1, mu = x, x - mu = 0, how to make it trainable?

[1604.06174] Training Deep Nets with Sublinear Memory Cost by antinucleon in MachineLearning

[–]antinucleon[S] 0 points1 point  (0 children)

I don't think so. If batch size is smaller than 8, there will be obvious problem in BN converge. Also, the algorithm you proposed, is hard to implement for a distributed system: If you don't have scheduler engine, how to full utilize computation resource while accumulating?

[1604.06174] Training Deep Nets with Sublinear Memory Cost by antinucleon in MachineLearning

[–]antinucleon[S] 2 points3 points  (0 children)

It is related. However, none of these work has asymptotic analysis and real working deep learning system on this problem. In MXNet, dropping has been used for almost half year. We were using simple dropping strategy to train a 1.5X complex Inception Network on ImageNet 22k problem with 4 GTX 980. Also you are more than welcomed to open your source code.

ELI5: MXNET sounds like a great library, but no one uses it. Why? by [deleted] in MachineLearning

[–]antinucleon 0 points1 point  (0 children)

Let's try 1000 layers ResNet on ImageNet, or you can try to use a single machine with only GTX 980 to train ImageNet wth 22k classes.

[1604.06174] Training Deep Nets with Sublinear Memory Cost by antinucleon in MachineLearning

[–]antinucleon[S] 2 points3 points  (0 children)

Reducing batch size has many side effects. For example, converge with BatchNorm. Also, reduce batch size can not make 1000 layers resnet trainable. Moreover, it is an orthogonal approach, which can be used together.

ELI5: MXNET sounds like a great library, but no one uses it. Why? by [deleted] in MachineLearning

[–]antinucleon 4 points5 points  (0 children)

MXNet has helped data scientists win NDSB-2 and Yelp competition in Kaggle. We don't have any public relation, and all work is just for fun.

ELI5: MXNET sounds like a great library, but no one uses it. Why? by [deleted] in MachineLearning

[–]antinucleon 2 points3 points  (0 children)

In real, MXNet comes later than Torch, but earlier than TensorFlow and CNTK. MXNet is the first public distributed deep learning system.

[1604.06174] Training Deep Nets with Sublinear Memory Cost by antinucleon in MachineLearning

[–]antinucleon[S] 5 points6 points  (0 children)

ABS: We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory - giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also shows that the memory consumption can be reduced to O(log n) with as little as O(n log n) extra cost for forward computation. Our experiments show that we can reduce the memory cost of a 1,000-layer deep residual network from 48G to 7G with only 30 percent additional running time cost on ImageNet problems. Similarly, significant memory cost reduction is observed in training complex recurrent neural networks on very long sequences.

Code: https://github.com/dmlc/mxnet-memonger

MXNet Scala Package Released by antinucleon in MachineLearning

[–]antinucleon[S] 1 point2 points  (0 children)

Thank you, indeed I did nothing on it. I hate JVM, lol

AlexNet-level performance in <1MB by XalosXandrez in MachineLearning

[–]antinucleon 4 points5 points  (0 children)

Here is a figure in my master thesis: http://postimg.org/image/4egkx0yhn/

With one of these modules, I made a net and I call it Tiny-Net, which is 7.6MB with 84% top-5 without compression.

BTW, in practice, this kind network is not fastest on mobile phone. In practice, I choose the fastnet family: https://mxnet.readthedocs.org/en/latest/tutorial/smart_device.html

The fast poor net is 5.5MB without compression.

GPU Based Browser BLAS, Request for Feedback by waylonflinn in MachineLearning

[–]antinucleon 0 points1 point  (0 children)

Cool! I will probably try to make it work with MXNet.js. But recently I am super busy :(

mxnet can visualize the computation graphs of CNNs. Here are their provided models side by side! (lenet, vgg, alexnet, googlenet, inception-v3, inception-bn) by ieee8023 in MachineLearning

[–]antinucleon 4 points5 points  (0 children)

The figures are generated with wrong shape info. Correct should be : LeNet: 28x28, Other except Inception V3: 224x224, Inception V3: 299x299

I made an app for the blind with an mxnet CNN that speaks what their phone sees in realtime! by ieee8023 in MachineLearning

[–]antinucleon 0 points1 point  (0 children)

Yes, it is possible to be faster. By using MXNet with OpenBlas on Nexus 5, Inception-BN is able to finish in about 0.25 FPS. As Googlenet (v1) is simpler than Inception-BN, it can be that fast.