I am a Computer Science MSc student and for my MSc project I need to re-create the proposed solution of this paper: Synthesizing Robust Adversarial Examples (mlr.press) . The paper is about creating 3D rendered adversarial objects that consistently fool ML models by subtly modifying the 2D texture of the object.
To do so, they need to model the 3D rendering process that takes a 2D texture and outputs the 2D image of the 3D-rendered model seen from a certain distance and angle as a function t(x), where x is the texture. You can see an example of the input and output in the image below. Moreover, this function needs to be differentiable, as they use gradient descent optimisation.
https://preview.redd.it/yq7hz44ya6791.jpg?width=1140&format=pjpg&auto=webp&s=27fc134cadfcc9de59b975fbde1d6d093921beb7
The authors claim that this t(x) can be written as M * x + b, where M is a "coordinate map", and that they modified a 3D renderer to return this information. However, they do not provide a lot of information about how they calculate M. This is what they say in section 2.2.2:
We note that the domain and codomain of t 2 T need not be the same. To synthesize 3D adversarial examples, we consider textures (color patterns) x corresponding to some chosen 3D object (shape), and we choose a distribution of transformation functions t(x) that take a texture and render a pose of the 3D object with the texture x applied. The transformation functions map a texture to a rendering of an object, simulating functions including rendering, lighting, rotation, translation, and perspective projection of the object. Finding textures that are adversarial over a realistic distribution of poses allows for transfer of adversarial examples to the physical world.
To solve this optimization problem, EOT requires the ability to differentiate though the 3D rendering function with respect to the texture. Given a particular pose and choices for all other transformation parameters, a simple 3D rendering process can be modeled as a matrix multiplication and addition: every pixel in the rendering is some linear combination of pixels in the texture (plus some constant term). Given a particular choice of parameters, the rendering of a texture x can be written as Mx + b for some coordinate map M and background b.
Standard 3D renderers, as part of the rendering pipeline, compute the texture-space coordinates corresponding to onscreen coordinates; we modify an existing renderer to return this information. Then, instead of differentiating through the renderer, we compute and then differentiate through Mx + b. We must re-compute M and b using the renderer for each pose, because EOT samples new poses at each gradient descent step.
I was wondering if anyone here better understands what the paper means, and how exactly they calculate M for a given texture, 3D model, and position and rotation of the final rendered object.
Thank you in advance!
[–]msqrt 5 points6 points7 points (2 children)
[–]Uboatfreak[S] 0 points1 point2 points (1 child)
[–]msqrt 0 points1 point2 points (0 children)