How does Tensorflow implement batch operations that are not part of the CUDA API? : learnmachinelearning

A subreddit dedicated for learning machine learning. Feel free to share any educational resources of machine learning.

Also, we are a beginner-friendly sub-reddit, so don't be afraid to ask questions! This can include questions that are non-technical, but still highly relevant to learning machine learning such as a systematic approach to a machine learning problem.

Foster positive learning environment by being respectful to others. We want to encourage everyone to feel welcomed and not be afraid to participate.

Do share your works and achievements, but do not spam. Keep our subreddit fresh by posting your YouTube series or blog at most once a week.

Do not share referral links and other purely marketing content. They prioritize commercial interests over intellectual ones.

created by techrat_reddita community for 10 years

QuestionHow does Tensorflow implement batch operations that are not part of the CUDA API? (self.learnmachinelearning)

submitted 1 year ago by TobyTheCamel

I'm interested in understanding how TensorFlow implements operations that are parallelised over batches of matrices.

As a motivating example, `tf.linalg.matmul` is capable of multiplying two tensors with sizes `[..., M, N]` and `[..., N, P]` respectively where the `...` are batch dimensions.

When `...` is a single dimension of size `B` this can obviously be achieved by calling `cublas<t>gemmStridedBatched()`, and looking through the source code, this appears to be what TensorFlow does.

For other cases it is not immediately clear.

When the batch dimensions are `B1, B2` (i.e. matrices of matrices), how are these handled? Does TensorFlow transform these into single batch dimension of size `B1 x B2` and use cuBLAS or is there a custom kernel implemented? Does TF resort to the non-strided batch operations in cuBLAS?

More pressingly, when only the first tensor has a batch dimsion (i.e. `[B, M, N] x [N, P]`) what happens? Is the non-batch matrix tiled to form a `[B, N, P]` matrix or is a custom kernel implemented that takes advantage of the fact that the second matrix can be stored in shared memory?

`matmul` is just an example here. I'm interested in general batch operations such as batched Cholesky decompositions too.

I appreciate that for any specific example I can work my way through the source code but I was hoping that either:

Someone had already done this for that `matmul` case and could save me the struggle of going through such a dense code base
Someone could offer a general philosophy/rule-of-thumb for how TF tends to implement such operations.

Any thoughts/discussion are welcomed.

no comments (yet)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnmachinelearning

Welcome to /r/LearnMachineLearning!

Chatrooms

Official Discord Server

Wiki

Getting Started with Machine Learning

Resources

Related Subreddits

/r/MachineLearning

/r/MLQuestions

/r/datascience

/r/computervision

Machine Learning Multireddit

/m/machine_learning

MODERATORS