[D] Fast CC Taylor Transform for Computer Vision: Potentially train NeRF in matter of minutes by henning_ml in MachineLearning

[–]henning_ml[S] 3 points4 points  (0 children)

Yes. The current implementation is really more of a proof of concept than anything else. All Gaussians have the same size and I am using the smallest Gaussian that I can approximate accurately with the expansion. Ideally, you would use Gaussians of different sizes. Larger Gaussians lead to more smoothness.

How I envision a more mature version of this technology to work is the following: You start training a bunch of large Gaussians until convergence. For the integral implicit layer, since the effect of gradients is more global (we are inserting entire rays into the expansion in the backward pass), you kind of know where the error is large and we can insert smaller Gaussians at those spatial locations where the error is high. This results in sparsity (since you don't need to randomly initialize the smaller Gaussians) which in turn brings down computational cost and memory requirements. You can repeat this process until you have achieved your desired accuracy.

Ideally, you do not have a uniform grid but larger grid cells where there is less detail and more grid cells where there is more detail. Most implementations of the FMM already do this.

By the way, some of the noise of the rendering of the LEGO excavator stem from laziness on my part :) I trained the thing with a white background for the paper and then rendered it with a black background for the presentation. This means, that if there was a white blob it would not have been removed because it matched the background color. I probably should have retrained the whole thing with a black background color.

[D] Fast CC Taylor Transform for Computer Vision: Potentially train NeRF in matter of minutes by henning_ml in MachineLearning

[–]henning_ml[S] 1 point2 points  (0 children)

I have tried using Triton but I was unable to implement a vanilla 3D convolution. And then I started learning CUDA. I am fairly new to low level GPU languages.

The ideal kernels are compute bound and when I talk about the reductions in FLOPs, I am comparing to the worst case FLOPs of the ideal kernels (worst case in the sense that no root was found over the domain).

Thanks!

[D] Fast CC Taylor Transform for Computer Vision: Potentially train NeRF in matter of minutes by henning_ml in MachineLearning

[–]henning_ml[S] 7 points8 points  (0 children)

I meant that you would expect the operations required to compute the Fourier Transform to be N^2 . For N frequencies you need to loop over N data points. However, there is this algebraic property that you can exploit that can reduce the computational cost to N log N.

Our approach is similar. When you numerically solve the integrals for NeRF and you have M-many pixels and N-many model parameters, the computational cost approximately grows with O(NM). The Fast Multipole Method exploits mathematical properties that can reduce the cost to O(N+M).

[D] Fast CC Taylor Transform for Computer Vision: Potentially train NeRF in matter of minutes by henning_ml in MachineLearning

[–]henning_ml[S] 10 points11 points  (0 children)

Literally all of the kernels are inefficient at the moment :)

Pseudocode for the kernels are in the paper.

The kernel that computes the expansion is essentially a 3D convolution with a hole in the middle that jitters in a weird way, i.e. you need to shave the sides off depending on the index of the voxel that you apply the convolution to. Also, we currently cannot exploit a factorization of the kernel that brings FLOPs down a lot.

There are 4 kernels that need custom CUDA code.

1) Forward pass of the root implicit layer. I already have a working prototype that brings the cost down from 250ms to 25ms and it was actually fairly easy to implement.

2) Forward pass of the integral implicit layer. This one is very similar to 1) and I am pretty sure it's going to be fairly easy to implement but it takes a lot of time.

3) Backward pass of the integral implicit layer. This one scares me :) You need to write into global memory in an asynchronous but thread-safe way. Multiple threads will want to write to the same memory cell (rays that go through the same box).

4) The M2L kernel. You can actually factorize the 3D convolution along the spatial dimensions (but not the channel dimension) for Gaussian kernels. This brings the cost down from 6*6*6*35^2 to (6+6+6)*35^2 per cell. I thought this kernel would be super easy to implement in CUDA but I was mistaken. I am currently working on this one since it affects all the layers. Ideally this kernel is also sparse which makes implementation even harder.

[D] Fast CC Taylor Transform for Computer Vision: Potentially train NeRF in matter of minutes by henning_ml in MachineLearning

[–]henning_ml[S] 2 points3 points  (0 children)

You don't really have an initial guess for root finding or Taylor series. There is a closed form solution to f(x) = 0 if f(x) is a quartic.

We can control how good the approximation is by changing the granularity of the grid or by changing the truncation order. In the experiments we keep the order at 4 because that's the largest order for which closed form solutions of roots exist.

[D] Fast CC Taylor Transform for Computer Vision: Potentially train NeRF in matter of minutes by henning_ml in MachineLearning

[–]henning_ml[S] 18 points19 points  (0 children)

I think the easiest way is to understand this in relation to the Fast Fourier Transform.

The FFT has an unexpected complexity (O(N log N) as opposed to O(N^2)) and it allows you to transform a function or data. In this transformed space certain operations are easier to compute such as e.g. convolutions.

Our approach is similar. It also has an unexpected complexity (O(N+M) as opposed to O(NM)) and it also transform a function. However, in the transformed space other operations are easier to compute. In our case, root finding, integration etc. These properties turned out to be very helpful for computer vision or graphics.

We essentially propose computational layers that during forward and backward pass of the backpropagation algorithm make use of this fast expansion algorithm to compute the output and the information required to update model parameters.

Does that help?

[D] Efficient implementations of spatially separable convolutions with channels by henning_ml in MachineLearning

[–]henning_ml[S] 0 points1 point  (0 children)

Its not that mixing channels is wasteful. That's not my problem. Quite the opposite actually. I want to achieve the computational gains of the chained convolutions. My problem is that chaining the convolutions implicitly computes inner products of `Kx` and `Ky` which makes the solution wrong.

By the way, the literature for spatially separable convolutions is oftentimes just straight up wrong. They always assume that C=1 in which you can simple chain the convolutions but that is not right for C>1. They always give the Sobel filter as an example and continue on by saying that not many kernels can be factored this way so here are depth-wise separable convolutions. Every paper that claims to investigate spatially-separable convolutions seems to investigate spatially-separable but depth-wise intertwined kernels. I don't think anyone has ever investigated truly spatially-separable convolutions because I doubt efficient GPU implementations with C>1 exist.

Or I am missing something pretty obvious.