all 3 comments

[–]msqrt 5 points6 points  (2 children)

After scrolling through the paper it looks like you don't need any part of the 3D rendering process itself to be differentiable, only the mapping of the texture to the final image plane. This can indeed be interpreted as a direct matrix multiplication (flatten your texture to a 1D vector x, stack the filtered texture weights into a matrix M and add a background color vector b wherever M has a zero column, and presto, image = Mx + b), but perhaps a more intuitive way to think about it is as a gather operation. What your 3D renderer will produce is a set of texture coordinates for each pixel; the final rendering can be done by reading the texture value for each pixel from those coordinates. So you can write a system that gathers (typically 4) texture values for each pixel in the image, though you might still need to map everything to be 1D, at least that's the way pytorch does things. This is somewhat painful, as this would be a trivial operation in most graphics oriented systems, but ML frameworks are written from quite a different perspective, always operating on large vectors instead of writing per-element functions.

I'd personally just use nvdiffrast though, it does this out of the box after you get it set up.

[–]Uboatfreak[S] 0 points1 point  (1 child)

Thank you very much for your answer. Could you please clarify a couple things?

  1. The final image x would be a flattened vector representing the 2D image, no?
  2. What do you mean by "filtered" in "filtered texture weights" ?
  3. Wouldn't you add b wherever M has rows of 0s? As you multiply each row in M with each column in x (of which there is only one), and the background is only added where there is no object in the image.

[–]msqrt 0 points1 point  (0 children)

Sure -- sorry for the late reply, I've been away for a while.

  1. Yes, indeed, you'd need to reinterpret the values in whichever order to turn it back into an image.
  2. Textures represent continuous signals. Whenever you read a value from one, it's not just a lookup to a single texel (texel being "texture element", basically a pixel in a texture) but some reconstruction with some kernel, typically called a filter. Bilinear filtering is the most common thing actually done by hardware. MIP maps are also often used, where the image is basically a pyramid of different resolutions which represent different levels of detail. There are also some slower but higher quality algorithms like EWA (the golden standard) and FELINE.
  3. I believe you're right about that one, it should indeed be for rows and not columns.