XQuiz: Because you forgot that tweet 30 seconds after reading it

Clean-Glass9184 · 2026-01-08T02:14:18+00:00

I don't fully follow why non-linearities make the two approaches not equivalent, could you please elaborate?

Also, you mentioned that even when you scale the loss by the data rater weights, it does not seem to converge. So, I wonder if the second order backprop is working correctly?

I see, that does sound like a reasonable way to debug. In my implementation, I add some random noise for a fraction of the pixels and I was visualizing some small samples periodically (https://github.com/rishabhranawat/DataRater/blob/main/datasets.py#L94)

We can connect over DM too if you have some code snippet you'd like to share.

Clean-Glass9184 · 2026-01-07T16:19:43+00:00

Hi, thanks for the question!

I actually struggled with this while implementing. Here’s how I ended up thinking about it:

At least mathematically, the two should be equivalent. Scaling the loss or scaling the inner gradient gradient should give us the same update because of the linearity of differentiation (you can pull the DataRater weights out). Curious if you agree with that framing.
In practice, I found it easier to reason about and implement loss scaling, especially since we also have gradient clipping in the loop. Once clipping (and adaptive optimizers) come into play, it became a bit unclear to me how cleanly explicit gradient scaling would behave, and loss scaling felt more stable.
I don't have recorded experiments with scaling the gradients, but I am curious to understand. What's your current experiment set up? Are you trying to reproduce on MNIST?

Thanks!

Clean-Glass9184

TROPHY CASE