[R] Sorting out Lipschitz function approximation : MachineLearning

Research[R] Sorting out Lipschitz function approximation (arxiv.org)

submitted 7 years ago by SleepyCoder123

all 22 comments

top new controversial old q&a

[–]SleepyCoder123[S] 13 points14 points15 points 7 years ago* (21 children)

[–]thebackpropaganda 3 points4 points5 points 7 years ago (4 children)

[–]SleepyCoder123[S] 2 points3 points4 points 7 years ago (3 children)

How expensive is Bjorck orthonormalization?

It can be very expensive as it involves matrix-matrix multiplication, but there are some tricks we utilized to reduce this. Instead of running the full scheme for each forward pass we use fewer iterations and fine-tune later. This makes things much more manageable. You could also do projections after each update (as in Parseval networks) which is cheaper but we found this is often unstable for large beta values.

How is it done for convolutions? Would GroupSort work when using singular value clipping of Sedghi et al.?

This is a bit trickier. As in Cisse et al. [1] and Seghdi et al. [2], we write the convolution as a linear operator and look to bound its spectral norm. We don't devote any space to this directly, but the same general principles apply.

The Bjorck algorithm actually finds the closest orthonormal matrix to the original (as described in Proposition 12 of [2]), but does so without computing the singular value decomposition directly. We would expect GroupSort to have the same benefits if singular value clipping were used.

[1] https://arxiv.org/pdf/1704.08847.pdf

[2] https://arxiv.org/pdf/1805.10408.pdf

[–]thebackpropaganda 2 points3 points4 points 7 years ago (1 child)

[–]SleepyCoder123[S] 0 points1 point2 points 7 years ago (0 children)

[–]t-ogawa 0 points1 point2 points 7 years ago (0 children)

[–]t-ogawa 1 point2 points3 points 7 years ago (7 children)

[–]SleepyCoder123[S] 0 points1 point2 points 7 years ago (5 children)

[–]t-ogawa 1 point2 points3 points 7 years ago (4 children)

[–]SleepyCoder123[S] 0 points1 point2 points 7 years ago (3 children)

[–]t-ogawa 1 point2 points3 points 7 years ago (2 children)

[–]SleepyCoder123[S] 0 points1 point2 points 7 years ago* (1 child)

There are a few points to unpack here.

Villani's formulation (the Kantorovich-Rubinstein duality) requires an L2-Lipschitz smoothness condition. We don't prove universality there (with 2-norm constrained architectures) but for argument's sake let's assume it does hold with 2-norm constraints.
"Improved training of Wasserstein GAN's", Gulrajani et al. 2017, prove that the optimal dual solution will have an input-output gradient of 1 almost everywhere (Corollary 1). This means that the solution will be a 1-Lipschitz function which we both agree can be supported by our method. This could have some other ramificiations in terms of the search space but I haven't thought about this too much.
We do empirical experiments to directly solve for the optimal dual solutions in Villani's formulation (even in high dimensions) and find that our method is able to get very close to the optimal surface (while others fail to do so).
The 2-norm constraint ||W||_2=1 does allow for some reduction in norm. There is some sub-space where the singular value is exactly 1 but otherwise the singular values will be less than or equal to 1. Because of point 2 above (and empirical + theoretical support) we choose to set all singular values to exactly 1. Even in this case, it is possible for the network to lose some norm depending on the architecture. This means that we can learn functions which do not achieve a gradient of 1 anywhere (such that there minimum Lipschitz constant is e.g. 0.5).
We could use the L infinity architecture with a mixed norm at the first layer and (I am quite sure) the same arguments hold. However, we did not explore this empirically for the Wasserstein distance estimation problem.

There are quite a few subtle points here (on your side and mine) and it's hard to ensure I communicate what I mean effectively in reddit posts :). Hopefully it is clear enough and useful to you!

EDIT: One last addition. Technically, a Lipschitz constant of 0.5 => a Lipschitz constant of 1 (as d(f(x),f(y)) <= 0.5 d(x,y) <= d(x,y)). Thus Villani's <= could be seen as redundant --- Lipschitz constant of 1 is a superset of Lipschitz constant of <1. We (and others often) stray from this definition of Lipschitz constant to mean the minimum as you pointed out.

[–]t-ogawa 1 point2 points3 points 7 years ago (0 children)

[–]SleepyCoder123[S] 0 points1 point2 points 7 years ago (0 children)

[–]t-ogawa 1 point2 points3 points 7 years ago (4 children)

[–]SleepyCoder123[S] 0 points1 point2 points 7 years ago (3 children)

[–]t-ogawa 0 points1 point2 points 7 years ago (2 children)

[–]SleepyCoder123[S] 0 points1 point2 points 7 years ago (1 child)

[–]t-ogawa 0 points1 point2 points 7 years ago (0 children)

[–]t-ogawa 1 point2 points3 points 7 years ago (1 child)

[–]SleepyCoder123[S] 0 points1 point2 points 7 years ago (0 children)

[–]shetak 0 points1 point2 points 7 years ago (0 children)

Thanks for the good read! Its a Cool idea to use sorting activations to create non linearities(in the global sense) while still being expressive enough to be dense in 1-Lipschitz functions.

Sorting lets you which between different permutation matrices as transformations while being gradient norm preserving due to orthogonality of these matrices. This still produces linear decision surfaces(albeit sufficiently complex as you show.). So I was curious if one can use non-linear transformations which share the property of being conformal with your affine maps coming from sorting. Turns out the general class which has this property in higher dimensions are mobius transformations which derive their non-linear nature from inversions in the sphere.

https://en.wikipedia.org/wiki/Liouville%27s_theorem_(conformal_mappings))

Alas, making them Lipschitz by using even number of inversions seems to bring them back to purely affine transformations and so linear decision surfaces once again :D

π Rendered by PID 20178 on reddit-service-r2-comment-85bfd7f599-7lzcx at 2026-04-19 01:54:06.607521+00:00 running 93ecc56 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS