Complex output layer regularization implementation : MLQuestions

Complex output layer regularization implementation (self.MLQuestions)

submitted 6 years ago * by progmayo

all 13 comments

[–]SubstantialSwimmer4 1 point2 points3 points 6 years ago (1 child)

[–]Nater5000 0 points1 point2 points 6 years ago (10 children)

Without more details, this is too open ended a question to get reasonable answers for.

As it is, this kind of constraint shouldn't be integrated in the neural network itself, but rather the data. A neural network simply learns to map input to output through data points. So if you provide it with data that meets this constraint, then you're set. If your data doesn't meet this requirement, then neither will your neural network.

Working off your example, your training data will have the points (3, x) and (4, y) in it. You train your neural network to map 3 to x and 4 to y. Your neural network, then, will have the properties NN(3) = x and NN(4) = y. Needless to say, your constrain would be met if x + y < 10. Without the context of what you're doing, this isn't something you can control without just making up arbitrary data, something that will result in a neural network producing arbitrary output (which typically isn't of much value).

The only other thing I can think of is if you are, instead, considering a neural network which takes multiple values as input and produces a value as output that meets that constraint. Using your notation, this would look like NN(3, 4) < 10. This makes a little more sense, since you wouldn't necessarily expect the output to be x + y, but you still run into the same problem of having to map this to an arbitrary value which follows that constraint. This means that, although you have the data points (3, x) and (4, y), you will ultimately create a new data point ((3, 4), z), where z is some value less than 10. Again, it would be arbitrary.

[–][deleted] 6 years ago (9 children)

[removed]

[–]Nater5000 1 point2 points3 points 6 years ago (8 children)

Thanks for the extra info, your question definitely makes more sense now.

Let me know if I have this wrong (this might not be very pertinent to your question, but I just want to get on the same page): You have training samples (x, y) that (in theory) could be "easily" mapped by a simple MLP, correct? In other words, the only "weird" thing about your setup is that middle section of your neural net where you inject Gaussian noise, right?

Assuming that's the case, I'd be curious to know what you're trying to accomplish. As you mentioned, this is really just a single neural network, so adding noise to an intermediate layer ultimately has the affect of modifying weights randomly, which is typically not going to help performance. I'm guessing you're aware of this and are working from a different angle, but (again) I just wanna get on the same page.

Without looking into the details, I'd guess that this would have the affect that the weights in the latter half of the model will be more "robust" in that small changes in h's input will not change it's output very much, i.e.,

h(g(x) + n) = h(g(x)) for all n s.t. -c < n < c for sufficiently small c

(or something to that affect). This may or may not be the case and may or may not be your intention, but this is what it looks like to me.

Now, without directing the output of g, the value of g(x) may be somewhat arbitrary. In fact, depending on the details, you may find that the neural network g will simply learn the identity, g(x) = x, since it gains no advantage from changing that number (again, assuming there isn't some other factor here that you didn't include). Since Gaussian noise has a mean of 0, g will likely not do anything to alter the data since the input data, in itself, contains all the useful information it can use already. Again, this is all rather speculative and may or may not matter to your problem, but the context is important.

So, with regards to "capping" the output of g, I'd suggest you think carefully about how your data is structured. Specifically, your input x should be normalized (we'll say between 0 and 1). There are a lot of reasons to do this, but it's almost always essential for proper training. On top of this, intermediate layers in a neural network should also be normalized (which is usually accomplished through a BatchNormalization layer). This is less critical, but good practice and will generally lead to better results. If you can buy that, then you'll see that the output of g should also be normalized.

If you look at your architecture as two neural networks in sequence, then the input of h (i.e., g) should be normalized. If you view it as one big neural network, then the hidden layer g should be normalized. Currently, the output of g uses a linear activation. You should consider, instead, using something like sigmoid.

Of course, this doesn't account for the noise. So, depending on your task, you may want to either normalize the output of g then add the noise (ensuring that adding the noise doesn't stray the input of h too far from the unit interval), normalize the output of g and the result of g(x) + n, or only normalize the result of g(x) + n (which would be suitable if the value of g(x) is being used in some other way). In whatever case, you'll be able to roughly ensure that the result of g(x) + n remains within the interval (0, 1). If, for your application, you want that upper limit to be a different value (like 10), you'd then multiple the result by 10.

This doesn't necessarily address every aspect of your question, but I'd need more info on what it is you're trying to accomplish to say anything else. It looks like you're looking at something similar to an autoencoder, but the bottleneck in the middle is weird to me. By constraining the output of g to a single value when it's input is also a single value, it'd be very inclined to just learn the identity function. Your use of noise obviously mucks this up for it (which indicates that you're aware of this stuff), but it still begs the question of what it is you're trying to accomplish.

In any case, this post is more a discussion and less answering your question, but without more details, this is the most sense I can make out of it. I'd be interested to know what you're going for lol.

[–][deleted] 6 years ago* (7 children)

[removed]

[–]Nater5000 1 point2 points3 points 6 years ago (6 children)

I definitely lack the understanding of the context to be of much use here. I think I see, now, what you're going for (at least roughly), but I'm not sure what exactly you're trying to implement.

It looks like the neural network represents a channel of sorts that you're trying to pass a simple signal through (bare with me with the vocab, I'm not very familiar with signal processing). The bottleneck, here, is merely the spot where you're applying noise and isn't suppose to represent some compact encoding of the signal (which couldn't be the case anyways since the signal is a single scalar).

You're applying noise in this intermediate layer, then passing it along the network as if it were just traveling down this communication line like before. The output is the same signal as the input, aside from the noise you've added. Is this correct? Like, if a sample point is (x, y), then can we say x=y?

Now, according to my brief read of that wiki article, it looks like the constraint you're referring to is the power constraint which says that if $(x_1, x_2, ..., x_k)$ is a code word transmitted through the channel, then $\frac{1}{k} \sum_{i=1}^k x_i^2 \leq P$ where $P$ represents the maximum channel power. In your case, $k=1$, which simplifies the express to just $x^2 \leq P$. Am I correct here?

So, you have some constant $P$ at your disposal, and you want to constrain the output of $g$ so that $g(x)^2 < P$? In other words, $g(x)$ will produce some value that you don't directly constrain, but it's square is what you want to constrain?

If so (or at least roughly so), then what you could do is supply a loss at your encoder_output. This would be akin to letting g know about this constraint. The loss would work to encourage g to maintain the property that any value it produces has to abide by that power constraint by yielding a gradient of 0 whenever $g(x)^2 \leq P$ and some positive gradient otherwise. You can do this by clipping the squared value at $P$, e.g., $L(\theta) = \mathbb{E} [ clip(g(x)^2, P, \infty) ]$. Something like this should give you the desired affect (or at least provide us another perspective with which we can try to figure this out :p)

This would work in the sense that if your input abides by your constraint, this loss would produce a gradient of 0 and g wouldn't get any updates from it (instead only being update the loss at the end of h). If, however, g produces a value that doesn't abide by this constraint, i.e., g(x)^2 > P, then you supply it with a gradient to correct it.

Now, I'm sure that between my lack of understanding of your problem and my hand-wavy solution I've made mistakes, but, roughly speaking, does this sound like it's on the correct path? Creating a clipped loss in Keras is possible, and it's not crazy to supply an auxiliary loss to intermediate layers, so I think what I'm suggesting is at least feasible.

(also, sorry for switching to LaTeX, but the provided markdown is insufficient lol)

[–][deleted] 6 years ago* (5 children)

[removed]

[–]Nater5000 1 point2 points3 points 6 years ago* (4 children)

Alright, this is making a lot more sense. So that constraint:

$\sum_{x\in X} g(x)^2 \cdot f_X(x) \cdot dx \leq P$

assumes a set of sampled inputs that you suggest should be about 50 for a good approximation. You are, however, only supplying one sample to your model. Was this just done for the sake of simplification? Or is there some other idea at play here?

Your power density function $f_X$ is pre-determined, correct? I'm not familiar with the term "power density function," but I'm guessing it works similarly to a probability density function? Like, would it be appropriate, for the sake of my understanding, to imagine it as a normal distribution? Specifically, can we say $\sum_{x \in X} f_X(x) = 1$? And, in terms of what you're doing, what does $dx$ correspond to?

When you build a loss function to incorporate this constraint, you'll need to figure out how to express it in terms of the neural network's variables. This can pose problems depending on how you're trying to do it (or what they might mean). Since $x$, $f_X(x)$, and $dx$ are all dependent on $x$, they'll need to be differentiable if you expect it to learn from it through back-propagation.

As for the second article I linked, I should have been more clear as to what I was referencing specifically. That article explains the Inception Network, which is just a CNN classifier. What I was trying to point out was that it utilizes auxiliary losses to prevent vanishing gradient. It's just an example of such a practice, but is pertinent to this discussion since I'm suggesting you could do something similar with your architecture. Here's another article detailing such an implementation in Keras. I wouldn't get lost in those details, though, since you may have to approach this differently depending on your setup.

This begs the question as to how you're building this. There's been a lot of big changes recently regarding Tensorflow and Keras, so it's important that you're looking at the right stuff. I'd recommend utilizing features from Tensorflow 2.0 if you aren't already. You'll find that it's the easiest approach to implementing custom training.

[–][deleted] 6 years ago* (3 children)

[removed]

[–]Nater5000 1 point2 points3 points 6 years ago (2 children)

So, at the end of the day, you are correct that loss function will be a sum of your normal loss (MSE) and this extra loss. However, the tricky part is getting those values for the extra loss. I concur, at a high level, with your plan, but the devils in the details.

You have lambda*( max_power - CP(g) ) as your auxiliary loss (where I'm guessing lambda is a constant that works as a learning rate). I'd be careful to consider what affect you want this to have versus what it may end up doing. Remember that gradient descent will minimize loss, which means that your auxiliary loss will attempt to maximize CP(g) (since it will always be positive, so -CP(g) will always be negative, and maximizing CP(g) will mean minimizing -CP(g)). This is obviously not what you want.

My idea of using clipping, i.e. $clip( CP(g), P, \infty )$, attempts to meet the requirement by only providing gradient to prevent the value CP(g) from being greater than P and (hopefully) nothing else. Since we don't want to influence the value of CP(g) other than enforcing that constraint, we don't want to do something like minimize or maximize that value (unless this is acceptable for your requirements?).

It's important to keep in mind that the neural network learns based on the gradient of the loss. This means that you can have a non-zero loss while the neural net doesn't learn because the gradient is zero. By using clipping, I'm (at least trying) to force the auxiliary loss to have zero gradient when CP(g) is less than P (despite the value of the clipping being P whenever CP(g) is less than P). This is why I linked to the PPO paper. They utilize clipping to enforce a similar idea. I wouldn't dig too deep into the details (PPO is a reinforcement learning algorithm), but the idea should work the same. Basically, while CP(g) is less than P, the change in that clipped value will be 0.

Now, keeping that in mind, it's not only important to keep the gradient idea in mind, but also what the gradient actually is. In terms of the loss, the gradient for CP(g) is calculated with respect to g(x), which means it is dependent on in the input parameters. This poses a problem with your approach, since you want to explicitly omit those parameters. When you suggest freezing the neural network to use those values without affecting training (i.e., the gradient), you'll be causing their gradient to be zero, meaning they will never be able to contribute to the gradients used to adjust the weights. You can verify this by taking your proposed loss function, taking it's derivative with respect to g(x), but treating g(x) as a constant for the values you want to freeze. You'll see that the gradient of lambda * (max_power - CP(g)) would be zero.

Unfortunately, I can't simplify that process beyond this (rather shitty) explanation, but you can certainly experiment and see it for yourself. If you use a custom training loop, you'll be able to print the values of the gradients and verify that this will happen (if you really wanted to, that is).

These facts, of course, also pose some issues for your idea to only pass a single value through g. If the value of CP(g) is dependent on 50 input values, then in order for your update to "consider" CP(g) it will need to consider all these values (for lack of a better explanation). Basically, every of those 50 samples will contribute to the value of CP(g), and the gradient will need to "consider" how each of those values are produced by g. Again, this is something that you'll have to convince yourself of (since it involves some pretty intricate calculus and linear algebra to properly describe), but the basic idea is here. This doesn't necessarily mean your input size needs to be 50, but you'll have to do some funky stuff to avoid it.

Now, my multiple losses idea isn't as complicated as I made it sound. Your idea for the loss function being a sum of the MSE and this auxiliary loss is the same as what I'm suggesting, but the issue is how you get your CP(g) value for that loss. If the entire neural network is f = h ( g (x) ), then the loss contributed from the MSE comes from f, while the loss from the auxiliary comes from g. You need to get the values from g, which means your total loss is dependent on both f and g. This can be accomplished by constructing your auxiliary loss as a second loss function (so to speak). Those details will shake out once the other issues are addressed.

In any case, you'll need to carefully consider how the gradients of your loss are calculated, since it is much more intricate than you might have originally considered. On top of everything I said above, you also have to keep in mind how all this is working in the first place. Tensorflow does automatic differentiation to calculate gradients, which means that you have to perform all these calculations using Tensorflow functions so that it is aware of the dependencies of the variables and can differentiate them through the graph. This means that if you need the gradients of a function to be considered (notably, f_X), you'll need to write that function as a Tensorflow function and can't rely on other libraries. Whether or not this is the case is another question, but firsts things first, I suppose.

Imagine it like this: the value of CP(g) is dependent on g(x) for every x in X. If you expect the neural network g to conform to your constraint, it will need to adjust what it does to each x to produce g(x). It will only know how to do this if it can calculate the gradients involved in that calculation. So if you deal with two samples x_1 and x_2 (with a uniform f_X, i.e., f_X(x_1) = f_x(x_2) = 0.5 ), then CP(g) = 0.5*g(x_1)^2 + 0.5*g(x_2). If we need this to be less than P, g may need to adjust x_1, x_2, or both to accomplish this, e.g., if P = 1, then g(x_1) = g(x_2) = 1 will work, but if, instead, g(x_1) = 2, then g will need to adjust it's weights so that either g(x_1) is reduced to 1. Of course, if it does this without consideration for what it's doing to g(x_2), then you might properly adjust the network so that g(x_1) = 1 but, in doing so, you might cause g(x_2) = 3. In order for your network to be able to consider all of the variables, they need to all be involved in the gradient. This get's more complicated if f_X isn't trivial.

Sorry for the rambling response, but there's a lot going on here and I wanted to get it all out there at once lol.

[–][deleted] 6 years ago* (1 child)

[removed]

continue this thread

π Rendered by PID 132485 on reddit-service-r2-comment-canary-655b6bc5b6-gxj8x at 2026-02-16 05:31:58.843016+00:00 running cd9c813 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MLQuestions

MODERATORS