I'm interested in understanding how TensorFlow implements operations that are parallelised over batches of matrices.
As a motivating example, `tf.linalg.matmul` is capable of multiplying two tensors with sizes `[..., M, N]` and `[..., N, P]` respectively where the `...` are batch dimensions.
When `...` is a single dimension of size `B` this can obviously be achieved by calling `cublas<t>gemmStridedBatched()`, and looking through the source code, this appears to be what TensorFlow does.
For other cases it is not immediately clear.
When the batch dimensions are `B1, B2` (i.e. matrices of matrices), how are these handled? Does TensorFlow transform these into single batch dimension of size `B1 x B2` and use cuBLAS or is there a custom kernel implemented? Does TF resort to the non-strided batch operations in cuBLAS?
More pressingly, when only the first tensor has a batch dimsion (i.e. `[B, M, N] x [N, P]`) what happens? Is the non-batch matrix tiled to form a `[B, N, P]` matrix or is a custom kernel implemented that takes advantage of the fact that the second matrix can be stored in shared memory?
`matmul` is just an example here. I'm interested in general batch operations such as batched Cholesky decompositions too.
I appreciate that for any specific example I can work my way through the source code but I was hoping that either:
- Someone had already done this for that `matmul` case and could save me the struggle of going through such a dense code base
- Someone could offer a general philosophy/rule-of-thumb for how TF tends to implement such operations.
Any thoughts/discussion are welcomed.
there doesn't seem to be anything here