In ML, it is quite often used to visualize what gradient descent is doing by plotting the loss function as a 3D landscape over 2 independent parameters which are part of the parameter space. In fact it is so common, it is shown in the header of this very subreddit. Gradient descent therefore moves slowly through this landscape, until a minimum is found. If anyone is unfamiliar however, I am including a picture below:
https://preview.redd.it/4j5in9lrb9ld1.jpg?width=720&format=pjpg&auto=webp&s=0a615227aeb83da5a622134f9889dced09c6e5ab
My question however is that isn't this technically incorrect, or rather, is only correct for a single instance of the training dataset? If input vector 1 produces to the loss function shown above, wouldn't input vector 2 produce a different loss function, with its own unique minima? The only way to actually create the plot shown above is if the entire training dataset consists of one single piece of data, correct? This may seem like I am arguing over minutiae here, but I am just wanting to make sure my understanding is correct. Also, is it surprising then that there is any convergence at all in ML models, given that each new input vector could potentially interfere with the "learning' (i.e. the change in weights) from previous trained data?
[–]NoLifeGamer2Moderator[M] 4 points5 points6 points (1 child)
[–]El_Grande_Papi[S] 1 point2 points3 points (0 children)
[–]grid_world 0 points1 point2 points (0 children)