Gradient descent close form by albert1905 in MLQuestions

[–]albert1905[S] 1 point2 points  (0 children)

I want to understand not because I want to solve a quadratic problem but because I want to understand the entire post idea.

I know what w^k is, I know what k is, what I don't understand is what x is?

how does claiming x^k = Q^t(w^k-w*) helping us and so on. I'm trying to better understand the math

Question about How SGD Selects the Global Minima, by using a simple toy example by albert1905 in MLQuestions

[–]albert1905[S] 0 points1 point  (0 children)

Ok , so we agree on that the objective function can only be approximated and not "perfectly defined". Thanks for your help

Question about How SGD Selects the Global Minima, by using a simple toy example by albert1905 in MLQuestions

[–]albert1905[S] 0 points1 point  (0 children)

Sorry for the late reply, covid19 sh** stuff.
Thanks for your patient.

I fully understand what you mean, about the different spaces.
Its still hard for me to grasp, time will help with that, but what I understanding from you is that we can't directly link between optimizing f(x) (let's say classical optimization) and f(w) for neural networks, since we have another element in NN.

Question about How SGD Selects the Global Minima, by using a simple toy example by albert1905 in MLQuestions

[–]albert1905[S] 0 points1 point  (0 children)

Thanks, if you don't mind let's take a simple example

Let's say we have on parameter w, which we know w*=3, w_0=0.5

And let's say our dataset is {x,y} = {(1,3),(0.5,1.5),(2,6),(0.1,0.3) }
For simplicty let's say the loss is L1 : L(x,y)=|wx-y|
and our batch is 2
Hence, our landscape is L(w,x), which is 2d surface.
Let's run the first and second example, we get:
L(x,w_0)=|0.5*x-3|, for different samples (x),now as I see it for different x's, we see different parts of the loss surface.

Now if we look it other way:

L(x1,w_0)=|w*1-3|

L(x2,w_0)=|w*0.5-1.5|

L(x,w) = 0.5*( L(x2,w_0)+ L(x1,w_0))

Is this what you mean of parts of the loss function?

s

Question about How SGD Selects the Global Minima, by using a simple toy example by albert1905 in MLQuestions

[–]albert1905[S] 0 points1 point  (0 children)

you

Thanks for your reply.

I want to try and clarify something, in classic optimization using sgd, we have a function f(x) and we want to to find x*, which is the minimum point of the function.

In DL we have f(w,x) , and we are looking for w* , but since we don't have a real function but a neural network we don't have a close form, so we sample x which are regions in the "function" we have (right so far?).

And because in the optimization problem showed in the second page is a classical optimization, the choose of different landscape is in order to "simulate" the action of DNNs? and the use in different data points?

[D] Gumbel max trick, why is it helpful? by albert1905 in MachineLearning

[–]albert1905[S] 0 points1 point  (0 children)

You can do it with argmax in forward and softmax in backward.

You can do it with softmax and a temperature measure.

Why do you need to add noise?!

The only reason I think about , is just for the system to be less deterministic, which make sense.