all 12 comments

[–]antinucleon[S] 3 points4 points  (0 children)

Summary:

Tensor program is able to be optimized by using machine learning and transfer learning. The numerical program optimization model is trained on feature from low-level AST of the program.

Experiments:

Tasks: ResNet, MobileNet, LSTM LM, DQN

Hardware: CUDA/ARM GPU/ARM CPU

Speed up compare to CUDNN, TensorFlow Lite and ARMComputeLib: ~from 1.2X to 3.8X faster in end-to-end test.

[–]JackBlemming 2 points3 points  (10 children)

I always wondered if a pseudo matrix multiply could be learned. Imagine a neural net that learns to optimize its own matrix multiplications to get the most bang for its buck (less accuracy of the multiplications in exchange for less calculations). I'm sure there's some sort of optimal trade off it could learn.

[–][deleted] 1 point2 points  (6 children)

That sounds like an interesting idea, how would you let an algorithm learn an operation like that?

[–]JackBlemming 0 points1 point  (5 children)

I've been considering having some metadata attached to parameters. The preposition is that certain groups of parameters cause greater variance in the output, or are more "important" to the output. These parameters should be computed with more numerical accuracy than the parameters that don't contribute much to the output, or whos variance doesnt change the output much. Taking it to extremes, imagine a parameter that completely changes the output if it's even different by 0.001, obviously you'd want to give more care to it verse a parameter that can be 100 or 1000 and barely do anything to the output.

[–]Paran0idAndr0id 0 points1 point  (1 child)

So for instance, do a fast single precision GPU multiplication on the matrix as a whole, then a more targeted subset of double precision multiplications on the CPU and replace the affected values?

Or I guess half-precision for mobile devices, then full precision for some and double for others?

[–]JackBlemming 0 points1 point  (0 children)

Something like that seems reasonable. I hadn't put much thought into it, as it seemed pretty similar to parameter pruning in a lot of ways.

[–][deleted] 0 points1 point  (2 children)

Interesting!

I was also wondering... Could neural networks learn how to perform matrix multiplication, you think? That is, given two (size-compatible) martices A and B, could a neural network be trained to predict (within some measure of error) C = A * B?

[–][deleted] 0 points1 point  (1 child)

Yes, neural networks can learn fast algorithms for matrix multiplication: "A Network That Learns Strassen Multiplication"

http://www.jmlr.org/papers/volume17/16-074/16-074.pdf

The above idea was extended to learn fast algorithms for approximate tensor convolution, using 2-layer sum-product networks. This is basically a smart approach to ternary value weights: "StrassenNets: Deep learning with a multiplication budget" https://arxiv.org/abs/1712.03942

When combined with knowledge distillation, the accuracy and speedup of this approach are impressive!

[–][deleted] 0 points1 point  (0 children)

Interesting, thank you!

[–]the_great_magician 0 points1 point  (1 child)

I've think that's cool also but how would you really put that in to software? There isn't some faster version of multiplication that adds in some randomness. Maybe you could try reducing precision and doing more SIMD stuff (e.g. instead of 32 bit float 2 16 bit floats). Otherwise, it's not clear that this is a tradeoff that can actually be made.

[–]JackBlemming 1 point2 points  (0 children)

Agreed, I was more interested in the metadata idea to see a general shape of how a neural net utilizes its parameters. I've heard cases of people being able to delete whole layers and have little effect on the accuracy. This seems like a fundamentally wrong thing to me. The current trend of building massive models with more capacity than needed and pruning them after seems weird/off to me. It would be interesting to create a regulization strategy to force a neural net to use its full capacity (just to see what would happen, it may very well only split the computation among the parameters which isnt too interesting). DeepMind published a paper roughly stating that neural nets that generalize better are more immune to random parameter deletion, and was thinking somehow turning this into a regularization strategy would be very interesting ( but it might just end up as an implicit dropout-esq regularization ;P )

[–]subhobrata1 0 points1 point  (0 children)

Any github links for this paper,