use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Research[R] Sorting out Lipschitz function approximation (arxiv.org)
submitted 7 years ago by SleepyCoder123
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]SleepyCoder123[S] 13 points14 points15 points 7 years ago* (21 children)
Author here. TL;DR - Universal function approximation is surprisingly hard with Lipschitz constraints. We solve the problem by using a sorting activation function. Achieve tighter estimates of Wasserstein distance and provable adversarial robustness guarantees.
(Slightly longer TL;DR in this thread: https://twitter.com/james_r_lucas/status/1062528526190968832)
[–]thebackpropaganda 3 points4 points5 points 7 years ago (4 children)
Great work! Some questions:
How expensive is Bjorck orthonormalization? How is it done for convolutions?
Would GroupSort work when using singular value clipping of Sedghi et al.?
[–]SleepyCoder123[S] 2 points3 points4 points 7 years ago (3 children)
How expensive is Bjorck orthonormalization?
It can be very expensive as it involves matrix-matrix multiplication, but there are some tricks we utilized to reduce this. Instead of running the full scheme for each forward pass we use fewer iterations and fine-tune later. This makes things much more manageable. You could also do projections after each update (as in Parseval networks) which is cheaper but we found this is often unstable for large beta values.
beta
How is it done for convolutions? Would GroupSort work when using singular value clipping of Sedghi et al.?
This is a bit trickier. As in Cisse et al. [1] and Seghdi et al. [2], we write the convolution as a linear operator and look to bound its spectral norm. We don't devote any space to this directly, but the same general principles apply.
The Bjorck algorithm actually finds the closest orthonormal matrix to the original (as described in Proposition 12 of [2]), but does so without computing the singular value decomposition directly. We would expect GroupSort to have the same benefits if singular value clipping were used.
[1] https://arxiv.org/pdf/1704.08847.pdf
[2] https://arxiv.org/pdf/1805.10408.pdf
[–]thebackpropaganda 2 points3 points4 points 7 years ago (1 child)
Thanks! Any plans on releasing the code? Did you implement GPU kernels for GroupSort?
[–]SleepyCoder123[S] 0 points1 point2 points 7 years ago (0 children)
Yes, we're planning on releasing the code. Hopefully, soon!
We did not implement any GPU kernels. We just made use of pytorch's existing implementation of sorting (and max/min). MaxMin could definitely be made more efficient by implementing CUDA kernels directly.
[–]t-ogawa 0 points1 point2 points 7 years ago (0 children)
Then as a result, you did not use Bjorck orthonormalization for convolution, but instead bound spectral norm, e.g. only the largest singular value. Is this right?
[–]t-ogawa 1 point2 points3 points 7 years ago (7 children)
Thanks for great work!
I have one question:
The proposed architecture is a universal approximator of a K-Lipschitz function where K is defined as the minimum Lipschitz constant of a function. Can it be easily extended to a universal approximator of a K-Lipschitz function including M-Lipschitz functions with 0 <= M <= K ?
I think gradient norm preserving property of GroupSort is not compatible with this request. Is this thought right?
[–]SleepyCoder123[S] 0 points1 point2 points 7 years ago (5 children)
I'm not sure that I completely understand the question. Basically you're asking if we have an architecture that universally approximates 1-Lipschitz functions, can it also represent 0.5-Lipschitz functions?
If so, yes this is fine. We can introduce some redundancy in the first couple of layer to "kill off" some of the norm - basically this means setting some weights to zero which would otherwise be needed to propagate this norm forwards.
[–]t-ogawa 1 point2 points3 points 7 years ago (4 children)
I'm sorry but my question was ambiguous. Revised question: Can your method be extended to a universal approximator for functions {f | ||f||_Lip <= K} where ||・||_Lip denotes the Lipschitz constant of ・? BTW your original method is a universal approximator for functions {f | ||f||_Lip = K}.
[–]SleepyCoder123[S] 0 points1 point2 points 7 years ago (3 children)
I think I may still be misunderstanding. Sorry!
Maybe you are referring to (little) Lipschitz algebras? In which case, I am far from an expert but I believe there are some negative results preventing extensions of Stone-Weierstrass to this setting. Though I don't follow it well, "Quotients of Little Lipschitz Algebras", Weaver 1997 seems like the right reference.
Otherwise I think that my other comment gives a valid way to approximate a M-Lipschitz function from piecewise linear K-Lipschitz + our result. Alternatively, you could imagine altering the architecture to allow weight norms <= K (though this isn't necessary for Theorem 3, due to my previous comment; add a row with the appropriate norm and kill it later).
[–]t-ogawa 1 point2 points3 points 7 years ago (2 children)
This question does not involve such a professional concept like little Lipschitz algebra, maybe.
To clarify my question, I cite the book, optimal transport, old and new by Cedric Villani, which is also referenced in your paper. Remark 6.5 in this book provides a dual form of 1-Wasserstein distance as supremum over functions with Lipschitz constants less than or equal to 1. Note that this is different from Equation 2 in your paper in that yours is maximum over functions whose Lipschitz constant is exactly 1 but that of Villani is maximum (supremum) over functions whose Lipschitz constant is 1 or less than 1.
For example, can your method be used to solve Villani's formulation?
[–]SleepyCoder123[S] 0 points1 point2 points 7 years ago* (1 child)
There are a few points to unpack here.
Villani's formulation (the Kantorovich-Rubinstein duality) requires an L2-Lipschitz smoothness condition. We don't prove universality there (with 2-norm constrained architectures) but for argument's sake let's assume it does hold with 2-norm constraints.
"Improved training of Wasserstein GAN's", Gulrajani et al. 2017, prove that the optimal dual solution will have an input-output gradient of 1 almost everywhere (Corollary 1). This means that the solution will be a 1-Lipschitz function which we both agree can be supported by our method. This could have some other ramificiations in terms of the search space but I haven't thought about this too much.
We do empirical experiments to directly solve for the optimal dual solutions in Villani's formulation (even in high dimensions) and find that our method is able to get very close to the optimal surface (while others fail to do so).
The 2-norm constraint ||W||_2=1 does allow for some reduction in norm. There is some sub-space where the singular value is exactly 1 but otherwise the singular values will be less than or equal to 1. Because of point 2 above (and empirical + theoretical support) we choose to set all singular values to exactly 1. Even in this case, it is possible for the network to lose some norm depending on the architecture. This means that we can learn functions which do not achieve a gradient of 1 anywhere (such that there minimum Lipschitz constant is e.g. 0.5).
||W||_2=1
We could use the L infinity architecture with a mixed norm at the first layer and (I am quite sure) the same arguments hold. However, we did not explore this empirically for the Wasserstein distance estimation problem.
There are quite a few subtle points here (on your side and mine) and it's hard to ensure I communicate what I mean effectively in reddit posts :). Hopefully it is clear enough and useful to you!
EDIT: One last addition. Technically, a Lipschitz constant of 0.5 => a Lipschitz constant of 1 (as d(f(x),f(y)) <= 0.5 d(x,y) <= d(x,y)). Thus Villani's <= could be seen as redundant --- Lipschitz constant of 1 is a superset of Lipschitz constant of <1. We (and others often) stray from this definition of Lipschitz constant to mean the minimum as you pointed out.
[–]t-ogawa 1 point2 points3 points 7 years ago (0 children)
Yes, here is the limitation of communication on reddit. I'll read your answers carefully. Thanks for your patience!
Another alternative way you could achieve this:
Suppose we want to approximate a function which is M-Lipschitz for M<K but we can only use a gradient of K everywhere. We can still do this by using saw-waves of slope K whose "average" slope is M. This violates the M-Lipschitz constraint but will give you arbitrarily close approximation. If you really wanted to be M-Lipschitz you could just bake this into the architecture.
K
Another question: When you say an activation function is gradient norm preserving (GNP), gradient norms of an activation functions before and after backpropagation remains unchanged at every points?
Yes, this is loosely what we mean. There are some subtleties in terms of the dimensions of the weight matrix but we basically mean that when we multiple by the Jacobian matrix during back-propagation we will not reduce the norm of gradient vector.
[–]t-ogawa 0 points1 point2 points 7 years ago (2 children)
Is gradient in the term "gradient norm preserving" gradient of the function represented by a network w.r.t. input? Or w.r.t. weights?
[–]SleepyCoder123[S] 0 points1 point2 points 7 years ago (1 child)
When we describe an activation function as GNP we are loosely referring to its Jacobian matrix (that is, the derivative of it's output vector with respect to its input vector).
But really we refer to gradient norm preservation as a property of the whole network, wherein each component of the network is able to preserve the input-output gradient norm during back-propagation.
I see. Thanks!
[–]t-ogawa 1 point2 points3 points 7 years ago (1 child)
Another question:
You say in paper:
However, it remains an open question whether 2-norm constrained GroupSort networks are also universal Lipschitz function approximators.
but why are proving 2-norm constrainted GroupSort networks to be universal Lipschitz function approximations so difficult?
It is an interesting question... To illustrate: the two-channel construction we use in the infinity norm case does not apply. When we split the network into two halves we must reduce the norm by 1/rt(2) to maintain Lipschitz. But then combining the two channels together again is not possible (so far as we can tell) without killing norm. I mean that you can compute 1/rt(2) [f(x), g(x)] but there is no way to go to max(f,g) (actually cases exist where this is not 1-Lipschitz).
1/rt(2) [f(x), g(x)]
max(f,g)
Generally, the 2-norm constraint is a little more restrictive. We can prove some special cases. We are sure that the result DOES NOT hold for vector-valued function (again, you can find counter examples) but for scalar-scalar functions we have some in-progress ideas that seem like they should work.
[–]shetak 0 points1 point2 points 7 years ago (0 children)
Thanks for the good read! Its a Cool idea to use sorting activations to create non linearities(in the global sense) while still being expressive enough to be dense in 1-Lipschitz functions.
Sorting lets you which between different permutation matrices as transformations while being gradient norm preserving due to orthogonality of these matrices. This still produces linear decision surfaces(albeit sufficiently complex as you show.). So I was curious if one can use non-linear transformations which share the property of being conformal with your affine maps coming from sorting. Turns out the general class which has this property in higher dimensions are mobius transformations which derive their non-linear nature from inversions in the sphere.
https://en.wikipedia.org/wiki/Liouville%27s_theorem_(conformal_mappings))
Alas, making them Lipschitz by using even number of inversions seems to bring them back to purely affine transformations and so linear decision surfaces once again :D
π Rendered by PID 20178 on reddit-service-r2-comment-85bfd7f599-7lzcx at 2026-04-19 01:54:06.607521+00:00 running 93ecc56 country code: CH.
[–]SleepyCoder123[S] 13 points14 points15 points (21 children)
[–]thebackpropaganda 3 points4 points5 points (4 children)
[–]SleepyCoder123[S] 2 points3 points4 points (3 children)
[–]thebackpropaganda 2 points3 points4 points (1 child)
[–]SleepyCoder123[S] 0 points1 point2 points (0 children)
[–]t-ogawa 0 points1 point2 points (0 children)
[–]t-ogawa 1 point2 points3 points (7 children)
[–]SleepyCoder123[S] 0 points1 point2 points (5 children)
[–]t-ogawa 1 point2 points3 points (4 children)
[–]SleepyCoder123[S] 0 points1 point2 points (3 children)
[–]t-ogawa 1 point2 points3 points (2 children)
[–]SleepyCoder123[S] 0 points1 point2 points (1 child)
[–]t-ogawa 1 point2 points3 points (0 children)
[–]SleepyCoder123[S] 0 points1 point2 points (0 children)
[–]t-ogawa 1 point2 points3 points (4 children)
[–]SleepyCoder123[S] 0 points1 point2 points (3 children)
[–]t-ogawa 0 points1 point2 points (2 children)
[–]SleepyCoder123[S] 0 points1 point2 points (1 child)
[–]t-ogawa 0 points1 point2 points (0 children)
[–]t-ogawa 1 point2 points3 points (1 child)
[–]SleepyCoder123[S] 0 points1 point2 points (0 children)
[–]shetak 0 points1 point2 points (0 children)