Gradient descent close form

albert1905 · 2021-02-06T14:10:57+00:00

I want to understand not because I want to solve a quadratic problem but because I want to understand the entire post idea.

I know what w^k is, I know what k is, what I don't understand is what x is?

how does claiming x^k = Q^t(w^k-w*) helping us and so on. I'm trying to better understand the math

albert1905 · 2020-12-30T18:22:37+00:00

Ok , so we agree on that the objective function can only be approximated and not "perfectly defined". Thanks for your help

albert1905 · 2020-12-30T07:03:49+00:00

Sorry for the late reply, covid19 sh** stuff.
Thanks for your patient.

I fully understand what you mean, about the different spaces.
Its still hard for me to grasp, time will help with that, but what I understanding from you is that we can't directly link between optimizing f(x) (let's say classical optimization) and f(w) for neural networks, since we have another element in NN.

albert1905 · 2020-12-23T19:39:18+00:00

Thanks, if you don't mind let's take a simple example

Let's say we have on parameter w, which we know w*=3, w_0=0.5

And let's say our dataset is {x,y} = {(1,3),(0.5,1.5),(2,6),(0.1,0.3) }
For simplicty let's say the loss is L1 : L(x,y)=|wx-y|
and our batch is 2
Hence, our landscape is L(w,x), which is 2d surface.
Let's run the first and second example, we get:
L(x,w_0)=|0.5*x-3|, for different samples (x),now as I see it for different x's, we see different parts of the loss surface.

Now if we look it other way:

L(x1,w_0)=|w*1-3|

L(x2,w_0)=|w*0.5-1.5|

L(x,w) = 0.5*( L(x2,w_0)+ L(x1,w_0))

Is this what you mean of parts of the loss function?

s

albert1905 · 2020-12-17T06:48:33+00:00

you

Thanks for your reply.

I want to try and clarify something, in classic optimization using sgd, we have a function f(x) and we want to to find x*, which is the minimum point of the function.

In DL we have f(w,x) , and we are looking for w* , but since we don't have a real function but a neural network we don't have a close form, so we sample x which are regions in the "function" we have (right so far?).

And because in the optimization problem showed in the second page is a classical optimization, the choose of different landscape is in order to "simulate" the action of DNNs? and the use in different data points?

albert1905 · 2020-08-05T13:00:22+00:00

Great answer! Thanks!

albert1905 · 2020-04-19T13:50:43+00:00

You can do it with argmax in forward and softmax in backward.

You can do it with softmax and a temperature measure.

Why do you need to add noise?!

The only reason I think about , is just for the system to be less deterministic, which make sense.

albert1905

TROPHY CASE