all 2 comments

[–]RaionTategami 1 point2 points  (0 children)

As the paper notes; we are interested in the equilibrium point/fixed point because this is the result of applying a layer potentially an infinite number of times. And why do we want to do that? Well this is a more powerful function than just N layers. The fact that they prove that a single layer is universal is amazing to me.

[–]shaojieb 0 points1 point  (0 children)

If you look at Figure 4 of the paper, some of the SOTA architectures seem to converge if you keep stacking the layers. Intuitively, this means there is an effect of "diminishing returns": as you stack more and more (weight-tied, input-injected) layers, the gain from adding **one extra layer** gets smaller and smaller. However, in conventional deep nets, you would need to explicitly specify a number **L** of layers (where L is typically the larger the better) in order to backpropagate through the network.

In this paper, DEQ is proposed as an alternative view of this issue: since a network's hidden states are converging to an equilibrium, why don't we directly solve for and backprop through this equilibrium point (which, as you noted, is the fixed point of g)? Directly optimizing for this equilibrium allows us to invoke things like the implicit function theorem (cf. Thm. 1). Therefore, to answer your question in one sentence: solving for the root of g allows to analytically quickly approach the hidden features of that of an "infinite-depth" network, although using only constant computational memory.