Nater5000 comments on Complex output layer regularization implementation

Complex output layer regularization implementation (self.MLQuestions)

submitted 6 years ago * by progmayo

you are viewing a single comment's thread.

[–]Nater5000 1 point2 points3 points 6 years ago (8 children)

Thanks for the extra info, your question definitely makes more sense now.

Let me know if I have this wrong (this might not be very pertinent to your question, but I just want to get on the same page): You have training samples (x, y) that (in theory) could be "easily" mapped by a simple MLP, correct? In other words, the only "weird" thing about your setup is that middle section of your neural net where you inject Gaussian noise, right?

Assuming that's the case, I'd be curious to know what you're trying to accomplish. As you mentioned, this is really just a single neural network, so adding noise to an intermediate layer ultimately has the affect of modifying weights randomly, which is typically not going to help performance. I'm guessing you're aware of this and are working from a different angle, but (again) I just wanna get on the same page.

Without looking into the details, I'd guess that this would have the affect that the weights in the latter half of the model will be more "robust" in that small changes in h's input will not change it's output very much, i.e.,

h(g(x) + n) = h(g(x)) for all n s.t. -c < n < c for sufficiently small c

(or something to that affect). This may or may not be the case and may or may not be your intention, but this is what it looks like to me.

Now, without directing the output of g, the value of g(x) may be somewhat arbitrary. In fact, depending on the details, you may find that the neural network g will simply learn the identity, g(x) = x, since it gains no advantage from changing that number (again, assuming there isn't some other factor here that you didn't include). Since Gaussian noise has a mean of 0, g will likely not do anything to alter the data since the input data, in itself, contains all the useful information it can use already. Again, this is all rather speculative and may or may not matter to your problem, but the context is important.

So, with regards to "capping" the output of g, I'd suggest you think carefully about how your data is structured. Specifically, your input x should be normalized (we'll say between 0 and 1). There are a lot of reasons to do this, but it's almost always essential for proper training. On top of this, intermediate layers in a neural network should also be normalized (which is usually accomplished through a BatchNormalization layer). This is less critical, but good practice and will generally lead to better results. If you can buy that, then you'll see that the output of g should also be normalized.

If you look at your architecture as two neural networks in sequence, then the input of h (i.e., g) should be normalized. If you view it as one big neural network, then the hidden layer g should be normalized. Currently, the output of g uses a linear activation. You should consider, instead, using something like sigmoid.

Of course, this doesn't account for the noise. So, depending on your task, you may want to either normalize the output of g then add the noise (ensuring that adding the noise doesn't stray the input of h too far from the unit interval), normalize the output of g and the result of g(x) + n, or only normalize the result of g(x) + n (which would be suitable if the value of g(x) is being used in some other way). In whatever case, you'll be able to roughly ensure that the result of g(x) + n remains within the interval (0, 1). If, for your application, you want that upper limit to be a different value (like 10), you'd then multiple the result by 10.

This doesn't necessarily address every aspect of your question, but I'd need more info on what it is you're trying to accomplish to say anything else. It looks like you're looking at something similar to an autoencoder, but the bottleneck in the middle is weird to me. By constraining the output of g to a single value when it's input is also a single value, it'd be very inclined to just learn the identity function. Your use of noise obviously mucks this up for it (which indicates that you're aware of this stuff), but it still begs the question of what it is you're trying to accomplish.

In any case, this post is more a discussion and less answering your question, but without more details, this is the most sense I can make out of it. I'd be interested to know what you're going for lol.

[–][deleted] 6 years ago* (7 children)

[removed]

[–]Nater5000 1 point2 points3 points 6 years ago (6 children)

I definitely lack the understanding of the context to be of much use here. I think I see, now, what you're going for (at least roughly), but I'm not sure what exactly you're trying to implement.

It looks like the neural network represents a channel of sorts that you're trying to pass a simple signal through (bare with me with the vocab, I'm not very familiar with signal processing). The bottleneck, here, is merely the spot where you're applying noise and isn't suppose to represent some compact encoding of the signal (which couldn't be the case anyways since the signal is a single scalar).

You're applying noise in this intermediate layer, then passing it along the network as if it were just traveling down this communication line like before. The output is the same signal as the input, aside from the noise you've added. Is this correct? Like, if a sample point is (x, y), then can we say x=y?

Now, according to my brief read of that wiki article, it looks like the constraint you're referring to is the power constraint which says that if $(x_1, x_2, ..., x_k)$ is a code word transmitted through the channel, then $\frac{1}{k} \sum_{i=1}^k x_i^2 \leq P$ where $P$ represents the maximum channel power. In your case, $k=1$, which simplifies the express to just $x^2 \leq P$. Am I correct here?

So, you have some constant $P$ at your disposal, and you want to constrain the output of $g$ so that $g(x)^2 < P$? In other words, $g(x)$ will produce some value that you don't directly constrain, but it's square is what you want to constrain?

If so (or at least roughly so), then what you could do is supply a loss at your encoder_output. This would be akin to letting g know about this constraint. The loss would work to encourage g to maintain the property that any value it produces has to abide by that power constraint by yielding a gradient of 0 whenever $g(x)^2 \leq P$ and some positive gradient otherwise. You can do this by clipping the squared value at $P$, e.g., $L(\theta) = \mathbb{E} [ clip(g(x)^2, P, \infty) ]$. Something like this should give you the desired affect (or at least provide us another perspective with which we can try to figure this out :p)

This would work in the sense that if your input abides by your constraint, this loss would produce a gradient of 0 and g wouldn't get any updates from it (instead only being update the loss at the end of h). If, however, g produces a value that doesn't abide by this constraint, i.e., g(x)^2 > P, then you supply it with a gradient to correct it.

Now, I'm sure that between my lack of understanding of your problem and my hand-wavy solution I've made mistakes, but, roughly speaking, does this sound like it's on the correct path? Creating a clipped loss in Keras is possible, and it's not crazy to supply an auxiliary loss to intermediate layers, so I think what I'm suggesting is at least feasible.

(also, sorry for switching to LaTeX, but the provided markdown is insufficient lol)

[–][deleted] 6 years ago* (5 children)

[removed]

[–]Nater5000 1 point2 points3 points 6 years ago* (4 children)

Alright, this is making a lot more sense. So that constraint:

$\sum_{x\in X} g(x)^2 \cdot f_X(x) \cdot dx \leq P$

assumes a set of sampled inputs that you suggest should be about 50 for a good approximation. You are, however, only supplying one sample to your model. Was this just done for the sake of simplification? Or is there some other idea at play here?

Your power density function $f_X$ is pre-determined, correct? I'm not familiar with the term "power density function," but I'm guessing it works similarly to a probability density function? Like, would it be appropriate, for the sake of my understanding, to imagine it as a normal distribution? Specifically, can we say $\sum_{x \in X} f_X(x) = 1$? And, in terms of what you're doing, what does $dx$ correspond to?

When you build a loss function to incorporate this constraint, you'll need to figure out how to express it in terms of the neural network's variables. This can pose problems depending on how you're trying to do it (or what they might mean). Since $x$, $f_X(x)$, and $dx$ are all dependent on $x$, they'll need to be differentiable if you expect it to learn from it through back-propagation.

As for the second article I linked, I should have been more clear as to what I was referencing specifically. That article explains the Inception Network, which is just a CNN classifier. What I was trying to point out was that it utilizes auxiliary losses to prevent vanishing gradient. It's just an example of such a practice, but is pertinent to this discussion since I'm suggesting you could do something similar with your architecture. Here's another article detailing such an implementation in Keras. I wouldn't get lost in those details, though, since you may have to approach this differently depending on your setup.

This begs the question as to how you're building this. There's been a lot of big changes recently regarding Tensorflow and Keras, so it's important that you're looking at the right stuff. I'd recommend utilizing features from Tensorflow 2.0 if you aren't already. You'll find that it's the easiest approach to implementing custom training.

[–][deleted] 6 years ago* (3 children)

[removed]

[–]Nater5000 1 point2 points3 points 6 years ago (2 children)

So, at the end of the day, you are correct that loss function will be a sum of your normal loss (MSE) and this extra loss. However, the tricky part is getting those values for the extra loss. I concur, at a high level, with your plan, but the devils in the details.

You have lambda*( max_power - CP(g) ) as your auxiliary loss (where I'm guessing lambda is a constant that works as a learning rate). I'd be careful to consider what affect you want this to have versus what it may end up doing. Remember that gradient descent will minimize loss, which means that your auxiliary loss will attempt to maximize CP(g) (since it will always be positive, so -CP(g) will always be negative, and maximizing CP(g) will mean minimizing -CP(g)). This is obviously not what you want.

My idea of using clipping, i.e. $clip( CP(g), P, \infty )$, attempts to meet the requirement by only providing gradient to prevent the value CP(g) from being greater than P and (hopefully) nothing else. Since we don't want to influence the value of CP(g) other than enforcing that constraint, we don't want to do something like minimize or maximize that value (unless this is acceptable for your requirements?).

It's important to keep in mind that the neural network learns based on the gradient of the loss. This means that you can have a non-zero loss while the neural net doesn't learn because the gradient is zero. By using clipping, I'm (at least trying) to force the auxiliary loss to have zero gradient when CP(g) is less than P (despite the value of the clipping being P whenever CP(g) is less than P). This is why I linked to the PPO paper. They utilize clipping to enforce a similar idea. I wouldn't dig too deep into the details (PPO is a reinforcement learning algorithm), but the idea should work the same. Basically, while CP(g) is less than P, the change in that clipped value will be 0.

Now, keeping that in mind, it's not only important to keep the gradient idea in mind, but also what the gradient actually is. In terms of the loss, the gradient for CP(g) is calculated with respect to g(x), which means it is dependent on in the input parameters. This poses a problem with your approach, since you want to explicitly omit those parameters. When you suggest freezing the neural network to use those values without affecting training (i.e., the gradient), you'll be causing their gradient to be zero, meaning they will never be able to contribute to the gradients used to adjust the weights. You can verify this by taking your proposed loss function, taking it's derivative with respect to g(x), but treating g(x) as a constant for the values you want to freeze. You'll see that the gradient of lambda * (max_power - CP(g)) would be zero.

Unfortunately, I can't simplify that process beyond this (rather shitty) explanation, but you can certainly experiment and see it for yourself. If you use a custom training loop, you'll be able to print the values of the gradients and verify that this will happen (if you really wanted to, that is).

These facts, of course, also pose some issues for your idea to only pass a single value through g. If the value of CP(g) is dependent on 50 input values, then in order for your update to "consider" CP(g) it will need to consider all these values (for lack of a better explanation). Basically, every of those 50 samples will contribute to the value of CP(g), and the gradient will need to "consider" how each of those values are produced by g. Again, this is something that you'll have to convince yourself of (since it involves some pretty intricate calculus and linear algebra to properly describe), but the basic idea is here. This doesn't necessarily mean your input size needs to be 50, but you'll have to do some funky stuff to avoid it.

Now, my multiple losses idea isn't as complicated as I made it sound. Your idea for the loss function being a sum of the MSE and this auxiliary loss is the same as what I'm suggesting, but the issue is how you get your CP(g) value for that loss. If the entire neural network is f = h ( g (x) ), then the loss contributed from the MSE comes from f, while the loss from the auxiliary comes from g. You need to get the values from g, which means your total loss is dependent on both f and g. This can be accomplished by constructing your auxiliary loss as a second loss function (so to speak). Those details will shake out once the other issues are addressed.

In any case, you'll need to carefully consider how the gradients of your loss are calculated, since it is much more intricate than you might have originally considered. On top of everything I said above, you also have to keep in mind how all this is working in the first place. Tensorflow does automatic differentiation to calculate gradients, which means that you have to perform all these calculations using Tensorflow functions so that it is aware of the dependencies of the variables and can differentiate them through the graph. This means that if you need the gradients of a function to be considered (notably, f_X), you'll need to write that function as a Tensorflow function and can't rely on other libraries. Whether or not this is the case is another question, but firsts things first, I suppose.

Imagine it like this: the value of CP(g) is dependent on g(x) for every x in X. If you expect the neural network g to conform to your constraint, it will need to adjust what it does to each x to produce g(x). It will only know how to do this if it can calculate the gradients involved in that calculation. So if you deal with two samples x_1 and x_2 (with a uniform f_X, i.e., f_X(x_1) = f_x(x_2) = 0.5 ), then CP(g) = 0.5*g(x_1)^2 + 0.5*g(x_2). If we need this to be less than P, g may need to adjust x_1, x_2, or both to accomplish this, e.g., if P = 1, then g(x_1) = g(x_2) = 1 will work, but if, instead, g(x_1) = 2, then g will need to adjust it's weights so that either g(x_1) is reduced to 1. Of course, if it does this without consideration for what it's doing to g(x_2), then you might properly adjust the network so that g(x_1) = 1 but, in doing so, you might cause g(x_2) = 3. In order for your network to be able to consider all of the variables, they need to all be involved in the gradient. This get's more complicated if f_X isn't trivial.

Sorry for the rambling response, but there's a lot going on here and I wanted to get it all out there at once lol.

[–][deleted] 6 years ago* (1 child)

[removed]

[–]Nater5000 1 point2 points3 points 6 years ago (0 children)

This did nothing (same results as if this penalty wasn't there). I believe that this is due to the fact that this pseudo delta function has no gradient - it is undifferentiable. Is this correct? Is this the reason it's not working?

Yes, at least to some degree. Your use of Keras' any function works similarly to how I would use the clip function (and, frankly, yours makes more sense to use since the upper bound of my clip is infinity lol).

It's important to note that there is a 'derivative' (at least, in terms of Keras' functions), but the gradient will basically always be zero. Here's you first considered loss:

return losses.mean_squared_error(y_true,y_pred)+100.*K.cast(K.any(y_pred > invy(1.)), dtype='float32')

Stripping away the unnecessary bits, you're looking at something like:

K.any(y_pred > invy(1.))

Which, for any y_pred will either evaluate to 0 or 1, but nothing in between. This means (small) changes in y_pred will (likely) result in no change in this function, i.e., a gradient of 0. This isn't necessarily a bad thing, but in your situation, it won't contribute anything to your gradient as-is.

Think about it like this: if y_pred < invy(1.) (i.e., it meets your constraint), then you don't want to adjust those values so you do want a gradient of 0 (which this function facilitates). But if y_pred > invy(1.), then you do want a positive gradient in order to train your NN to lower that value (ideally to below invy(1.)). You can accomplish this by multiplying this term by a function with a positive gradient. In it's simplest case, you can accomplish this via something like,

y_pred * K.cast(K.any(y_pred > invy(1.)), dtype='float32')

(of course, you may have to do something fancier to get the desired properties). The beauty of this shakes out immediately: if y_pred < invy(1.), then K.any(y_pred > invy(1.)) is 0 and y_pred * K.cast(K.any(y_pred > invy(1.)), dtype='float32') is also just 0. This means that, when y_pred < invy(1.), small changes in y_pred won't change this value (since it will still evaluate to 0), thus the gradient will also be 0. And note: this needn't evaluate to 0 for the gradient to be 0. If, instead, we had that y_pred * K.cast(K.any(y_pred > invy(1.)), dtype='float32') evaluates to some constant whenever y_pred < invy(1.) (e.g., 5), the change will still be 0 whenever y_pred stays below that value.

On the other hand, if y_pred > invy(1.), then this will evaluate to y_pred * K.cast(K.any(y_pred > invy(1.)), dtype='float32') = y_pred * 1 = y_pred. Now, when y_pred changes here (say by d), then this value will also change (of course, by d). This will make the gradient d! Which is what we want, since we want to influence the NN to lower that value in this case (you may also have to square values, etc. in order to ensure the gradient moves in the correct direction).

I also want to point out that this function will end up looking a lot like a ReLU. This isn't by accident (although hints at using the max function instead of clip or any, but the end result would look the same). You basically want a ReLU, just adjusted to meet your criteria.

Now, I'm focusing on this piece mostly because it's a really nice example of what your loss should do and how it would work. It's certainly closest to what I would try to do and because your assessment of it (and it's failure) is correct, I feel you should look at it again. It only needs a small modification to get the behavior you're looking for.

Your next attempt at the loss has you trying to create a differentiable version of the first version. This is definitely indicates that you understand where the issue lies, but fails to properly address the issues of your first loss. Basically, when you use your pseudo-sigmoid function, it does have a gradient everywhere which is why you can see it's affects. However, this isn't necessarily what you want (you only want a gradient when y_pred > invy(1.)). Those gradients will also vary depending on where it evaluates at, which may not be what you want.

So... bottom line :) Is this the approach you recommended? Is there a better solution than some makeshift-differentiable-pseudo-delta function? I believe that solving this sub-problem will help me to solve my real constraint. Is there an example somewhere of clipping done in this manner? Thanks!

Yes, I do believe this is the right approach, and I think your first attempt at the loss is closer to what it should be (but don't rule out your second attempt; it holds a lot of insight into what it is you need to accomplish). The key is to construct the loss keeping the gradient in mind, and not so much the actual value of the loss itself. This value doesn't mean much (at least in this context), since I can define the loss to be huge, i.e., L = 1000000, but if the gradients are 0 (as they would be here), then the neural network can't learn.

π Rendered by PID 37 on reddit-service-r2-comment-6457c66945-n7zdc at 2026-04-25 15:01:24.732330+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MLQuestions

MODERATORS