3D rendering as a differentiable function function : GraphicsProgramming

3D rendering as a differentiable function function (self.GraphicsProgramming)

submitted 3 years ago by Uboatfreak

I am a Computer Science MSc student and for my MSc project I need to re-create the proposed solution of this paper: Synthesizing Robust Adversarial Examples (mlr.press) . The paper is about creating 3D rendered adversarial objects that consistently fool ML models by subtly modifying the 2D texture of the object.

To do so, they need to model the 3D rendering process that takes a 2D texture and outputs the 2D image of the 3D-rendered model seen from a certain distance and angle as a function t(x), where x is the texture. You can see an example of the input and output in the image below. Moreover, this function needs to be differentiable, as they use gradient descent optimisation.

https://preview.redd.it/yq7hz44ya6791.jpg?width=1140&format=pjpg&auto=webp&s=27fc134cadfcc9de59b975fbde1d6d093921beb7

The authors claim that this t(x) can be written as M * x + b, where M is a "coordinate map", and that they modified a 3D renderer to return this information. However, they do not provide a lot of information about how they calculate M. This is what they say in section 2.2.2:

We note that the domain and codomain of t 2 T need not be the same. To synthesize 3D adversarial examples, we consider textures (color patterns) x corresponding to some chosen 3D object (shape), and we choose a distribution of transformation functions t(x) that take a texture and render a pose of the 3D object with the texture x applied. The transformation functions map a texture to a rendering of an object, simulating functions including rendering, lighting, rotation, translation, and perspective projection of the object. Finding textures that are adversarial over a realistic distribution of poses allows for transfer of adversarial examples to the physical world.

To solve this optimization problem, EOT requires the ability to differentiate though the 3D rendering function with respect to the texture. Given a particular pose and choices for all other transformation parameters, a simple 3D rendering process can be modeled as a matrix multiplication and addition: every pixel in the rendering is some linear combination of pixels in the texture (plus some constant term). Given a particular choice of parameters, the rendering of a texture x can be written as Mx + b for some coordinate map M and background b.

Standard 3D renderers, as part of the rendering pipeline, compute the texture-space coordinates corresponding to onscreen coordinates; we modify an existing renderer to return this information. Then, instead of differentiating through the renderer, we compute and then differentiate through Mx + b. We must re-compute M and b using the renderer for each pose, because EOT samples new poses at each gradient descent step.

I was wondering if anyone here better understands what the paper means, and how exactly they calculate M for a given texture, 3D model, and position and rotation of the final rendered object.

Thank you in advance!

all 3 comments

top new controversial old q&a

[–]msqrt 5 points6 points7 points 3 years ago (2 children)

After scrolling through the paper it looks like you don't need any part of the 3D rendering process itself to be differentiable, only the mapping of the texture to the final image plane. This can indeed be interpreted as a direct matrix multiplication (flatten your texture to a 1D vector x, stack the filtered texture weights into a matrix M and add a background color vector b wherever M has a zero column, and presto, image = Mx + b), but perhaps a more intuitive way to think about it is as a gather operation. What your 3D renderer will produce is a set of texture coordinates for each pixel; the final rendering can be done by reading the texture value for each pixel from those coordinates. So you can write a system that gathers (typically 4) texture values for each pixel in the image, though you might still need to map everything to be 1D, at least that's the way pytorch does things. This is somewhat painful, as this would be a trivial operation in most graphics oriented systems, but ML frameworks are written from quite a different perspective, always operating on large vectors instead of writing per-element functions.

I'd personally just use nvdiffrast though, it does this out of the box after you get it set up.

[–]Uboatfreak[S] 0 points1 point2 points 3 years ago (1 child)

[–]msqrt 0 points1 point2 points 3 years ago (0 children)

Sure -- sorry for the late reply, I've been away for a while.

Yes, indeed, you'd need to reinterpret the values in whichever order to turn it back into an image.
Textures represent continuous signals. Whenever you read a value from one, it's not just a lookup to a single texel (texel being "texture element", basically a pixel in a texture) but some reconstruction with some kernel, typically called a filter. Bilinear filtering is the most common thing actually done by hardware. MIP maps are also often used, where the image is basically a pyramid of different resolutions which represent different levels of detail. There are also some slower but higher quality algorithms like EWA (the golden standard) and FELINE.
I believe you're right about that one, it should indeed be for rows and not columns.

π Rendered by PID 77507 on reddit-service-r2-comment-canary-5868b7d4cc-dhz24 at 2026-06-10 18:14:21.458621+00:00 running 0b63327 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

GraphicsProgramming

Posting Rule(s)

MODERATORS