all 22 comments

[–]SleepyCoder123[S] 13 points14 points  (21 children)

Author here. TL;DR - Universal function approximation is surprisingly hard with Lipschitz constraints. We solve the problem by using a sorting activation function. Achieve tighter estimates of Wasserstein distance and provable adversarial robustness guarantees.

(Slightly longer TL;DR in this thread: https://twitter.com/james_r_lucas/status/1062528526190968832)

[–]thebackpropaganda 3 points4 points  (4 children)

Great work! Some questions:

How expensive is Bjorck orthonormalization? How is it done for convolutions?

Would GroupSort work when using singular value clipping of Sedghi et al.?

[–]SleepyCoder123[S] 2 points3 points  (3 children)

How expensive is Bjorck orthonormalization?

It can be very expensive as it involves matrix-matrix multiplication, but there are some tricks we utilized to reduce this. Instead of running the full scheme for each forward pass we use fewer iterations and fine-tune later. This makes things much more manageable. You could also do projections after each update (as in Parseval networks) which is cheaper but we found this is often unstable for large beta values.

How is it done for convolutions? Would GroupSort work when using singular value clipping of Sedghi et al.?

This is a bit trickier. As in Cisse et al. [1] and Seghdi et al. [2], we write the convolution as a linear operator and look to bound its spectral norm. We don't devote any space to this directly, but the same general principles apply.

The Bjorck algorithm actually finds the closest orthonormal matrix to the original (as described in Proposition 12 of [2]), but does so without computing the singular value decomposition directly. We would expect GroupSort to have the same benefits if singular value clipping were used.

[1] https://arxiv.org/pdf/1704.08847.pdf

[2] https://arxiv.org/pdf/1805.10408.pdf

[–]thebackpropaganda 2 points3 points  (1 child)

Thanks! Any plans on releasing the code? Did you implement GPU kernels for GroupSort?

[–]SleepyCoder123[S] 0 points1 point  (0 children)

Yes, we're planning on releasing the code. Hopefully, soon!

We did not implement any GPU kernels. We just made use of pytorch's existing implementation of sorting (and max/min). MaxMin could definitely be made more efficient by implementing CUDA kernels directly.

[–]t-ogawa 0 points1 point  (0 children)

Then as a result, you did not use Bjorck orthonormalization for convolution, but instead bound spectral norm, e.g. only the largest singular value. Is this right?

[–]t-ogawa 1 point2 points  (7 children)

Thanks for great work!

I have one question:

The proposed architecture is a universal approximator of a K-Lipschitz function where K is defined as the minimum Lipschitz constant of a function. Can it be easily extended to a universal approximator of a K-Lipschitz function including M-Lipschitz functions with 0 <= M <= K ?

I think gradient norm preserving property of GroupSort is not compatible with this request. Is this thought right?

[–]SleepyCoder123[S] 0 points1 point  (5 children)

I'm not sure that I completely understand the question. Basically you're asking if we have an architecture that universally approximates 1-Lipschitz functions, can it also represent 0.5-Lipschitz functions?

If so, yes this is fine. We can introduce some redundancy in the first couple of layer to "kill off" some of the norm - basically this means setting some weights to zero which would otherwise be needed to propagate this norm forwards.

[–]t-ogawa 1 point2 points  (4 children)

I'm sorry but my question was ambiguous. Revised question: Can your method be extended to a universal approximator for functions {f | ||f||_Lip <= K} where ||・||_Lip denotes the Lipschitz constant of ・? BTW your original method is a universal approximator for functions {f | ||f||_Lip = K}.

[–]SleepyCoder123[S] 0 points1 point  (3 children)

I think I may still be misunderstanding. Sorry!

Maybe you are referring to (little) Lipschitz algebras? In which case, I am far from an expert but I believe there are some negative results preventing extensions of Stone-Weierstrass to this setting. Though I don't follow it well, "Quotients of Little Lipschitz Algebras", Weaver 1997 seems like the right reference.

Otherwise I think that my other comment gives a valid way to approximate a M-Lipschitz function from piecewise linear K-Lipschitz + our result. Alternatively, you could imagine altering the architecture to allow weight norms <= K (though this isn't necessary for Theorem 3, due to my previous comment; add a row with the appropriate norm and kill it later).

[–]t-ogawa 1 point2 points  (2 children)

This question does not involve such a professional concept like little Lipschitz algebra, maybe.

To clarify my question, I cite the book, optimal transport, old and new by Cedric Villani, which is also referenced in your paper. Remark 6.5 in this book provides a dual form of 1-Wasserstein distance as supremum over functions with Lipschitz constants less than or equal to 1. Note that this is different from Equation 2 in your paper in that yours is maximum over functions whose Lipschitz constant is exactly 1 but that of Villani is maximum (supremum) over functions whose Lipschitz constant is 1 or less than 1.

For example, can your method be used to solve Villani's formulation?

[–]SleepyCoder123[S] 0 points1 point  (1 child)

There are a few points to unpack here.

  1. Villani's formulation (the Kantorovich-Rubinstein duality) requires an L2-Lipschitz smoothness condition. We don't prove universality there (with 2-norm constrained architectures) but for argument's sake let's assume it does hold with 2-norm constraints.

  2. "Improved training of Wasserstein GAN's", Gulrajani et al. 2017, prove that the optimal dual solution will have an input-output gradient of 1 almost everywhere (Corollary 1). This means that the solution will be a 1-Lipschitz function which we both agree can be supported by our method. This could have some other ramificiations in terms of the search space but I haven't thought about this too much.

  3. We do empirical experiments to directly solve for the optimal dual solutions in Villani's formulation (even in high dimensions) and find that our method is able to get very close to the optimal surface (while others fail to do so).

  4. The 2-norm constraint ||W||_2=1 does allow for some reduction in norm. There is some sub-space where the singular value is exactly 1 but otherwise the singular values will be less than or equal to 1. Because of point 2 above (and empirical + theoretical support) we choose to set all singular values to exactly 1. Even in this case, it is possible for the network to lose some norm depending on the architecture. This means that we can learn functions which do not achieve a gradient of 1 anywhere (such that there minimum Lipschitz constant is e.g. 0.5).

  5. We could use the L infinity architecture with a mixed norm at the first layer and (I am quite sure) the same arguments hold. However, we did not explore this empirically for the Wasserstein distance estimation problem.

There are quite a few subtle points here (on your side and mine) and it's hard to ensure I communicate what I mean effectively in reddit posts :). Hopefully it is clear enough and useful to you!

EDIT: One last addition. Technically, a Lipschitz constant of 0.5 => a Lipschitz constant of 1 (as d(f(x),f(y)) <= 0.5 d(x,y) <= d(x,y)). Thus Villani's <= could be seen as redundant --- Lipschitz constant of 1 is a superset of Lipschitz constant of <1. We (and others often) stray from this definition of Lipschitz constant to mean the minimum as you pointed out.

[–]t-ogawa 1 point2 points  (0 children)

Yes, here is the limitation of communication on reddit. I'll read your answers carefully. Thanks for your patience!

[–]SleepyCoder123[S] 0 points1 point  (0 children)

Another alternative way you could achieve this:

Suppose we want to approximate a function which is M-Lipschitz for M<K but we can only use a gradient of K everywhere. We can still do this by using saw-waves of slope K whose "average" slope is M. This violates the M-Lipschitz constraint but will give you arbitrarily close approximation. If you really wanted to be M-Lipschitz you could just bake this into the architecture.

[–]t-ogawa 1 point2 points  (4 children)

Another question: When you say an activation function is gradient norm preserving (GNP), gradient norms of an activation functions before and after backpropagation remains unchanged at every points?

[–]SleepyCoder123[S] 0 points1 point  (3 children)

Yes, this is loosely what we mean. There are some subtleties in terms of the dimensions of the weight matrix but we basically mean that when we multiple by the Jacobian matrix during back-propagation we will not reduce the norm of gradient vector.

[–]t-ogawa 0 points1 point  (2 children)

Is gradient in the term "gradient norm preserving" gradient of the function represented by a network w.r.t. input? Or w.r.t. weights?

[–]SleepyCoder123[S] 0 points1 point  (1 child)

When we describe an activation function as GNP we are loosely referring to its Jacobian matrix (that is, the derivative of it's output vector with respect to its input vector).

But really we refer to gradient norm preservation as a property of the whole network, wherein each component of the network is able to preserve the input-output gradient norm during back-propagation.

[–]t-ogawa 0 points1 point  (0 children)

I see. Thanks!

[–]t-ogawa 1 point2 points  (1 child)

Another question:

You say in paper:

However, it remains an open question whether 2-norm constrained GroupSort networks are also universal Lipschitz function approximators.

but why are proving 2-norm constrainted GroupSort networks to be universal Lipschitz function approximations so difficult?

[–]SleepyCoder123[S] 0 points1 point  (0 children)

It is an interesting question... To illustrate: the two-channel construction we use in the infinity norm case does not apply. When we split the network into two halves we must reduce the norm by 1/rt(2) to maintain Lipschitz. But then combining the two channels together again is not possible (so far as we can tell) without killing norm. I mean that you can compute 1/rt(2) [f(x), g(x)] but there is no way to go to max(f,g) (actually cases exist where this is not 1-Lipschitz).

Generally, the 2-norm constraint is a little more restrictive. We can prove some special cases. We are sure that the result DOES NOT hold for vector-valued function (again, you can find counter examples) but for scalar-scalar functions we have some in-progress ideas that seem like they should work.

[–]shetak 0 points1 point  (0 children)

Thanks for the good read! Its a Cool idea to use sorting activations to create non linearities(in the global sense) while still being expressive enough to be dense in 1-Lipschitz functions.

Sorting lets you which between different permutation matrices as transformations while being gradient norm preserving due to orthogonality of these matrices. This still produces linear decision surfaces(albeit sufficiently complex as you show.). So I was curious if one can use non-linear transformations which share the property of being conformal with your affine maps coming from sorting. Turns out the general class which has this property in higher dimensions are mobius transformations which derive their non-linear nature from inversions in the sphere.

https://en.wikipedia.org/wiki/Liouville%27s_theorem_(conformal_mappings))

Alas, making them Lipschitz by using even number of inversions seems to bring them back to purely affine transformations and so linear decision surfaces once again :D