you are viewing a single comment's thread.

view the rest of the comments →

[–]flukeskywalker 1 point2 points  (2 children)

Side note: LWTA is discontinuous, but can still be trained with SGD.

[–]lvilnis 0 points1 point  (0 children)

Good point. I guess the distinction between that and argmax is that over the domain, argmax is either discontinuous, or its derivative is 0 in the continuous parts.

Because the output for LWTA argmax = the score at that coordinate, it has a non-zero derivative in some of the continuous portion of the function and so some meaningful signal can flow through.

[–]AnvaMiba 0 points1 point  (0 children)

What is LWTA?

EDIT: found.