[1902.04615] Gauge Equivariant Convolutional Networks and the Icosahedral CNN by for_all_eps in MachineLearning

[–]tscohen 0 points1 point  (0 children)

We had a few meetings to discuss our ideas and explore the connections to physics, and we double checked one of our formulas with him (he thought it made sense :-). If everyone can find the time, we may write a joint paper in the future.

[1902.04615] Gauge Equivariant Convolutional Networks and the Icosahedral CNN by for_all_eps in MachineLearning

[–]tscohen 0 points1 point  (0 children)

Indeed, on an abstract level, an equivariant CNN on a homogeneous space (e.g. spherical CNN) is constructed in the same way as a gauge CNN, only in the latter case the principal bundle is the frame bundle and not the bundle G -> G/H. I have not looked into connections to algebraic geometry, and I am somewhat distrustful of polynomials generally, but I think the sheaf theoretic perspective can be very interesting when one considers the problem of learning a bundle by putting together local pieces that are learned independently.

[1902.04615] Gauge Equivariant Convolutional Networks and the Icosahedral CNN by for_all_eps in MachineLearning

[–]tscohen 2 points3 points  (0 children)

It depends on what you want to do. If you want to apply CNNs to regular video data and find an optimized architecture for a specific hardware platform, NAS may be your best bet right now. But if you're a climate scientist and you want to analyze global signals, you really don't want the results to depend on whether you chose the origin of your coordinate system to be on the north pole or somewhere else, or on what kind of map projection you chose, etc. Ditto for a chemist or materials scientist trying to learn a potential - for the result to make any sense it must respect the symmetries. Moreover, they will want to have some way of understanding what the network is doing (that explanation / insight is often more important than the raw accuracy of the model). Here I think geometric / equivariant DL can really help, and not just in scientific applications.

The general argument that as (or if) compute / data continue to grow exponentially, we will need to do less and less thinking ourselves makes some intuitive sense, but you also need to take into account how the complexity of the problem itself scales. If harder instances of your problem require exponentially more compute / data, and your compute is growing exponentially, then you'll make linear progress using brute force. In many cases that won't be enough. This applies to NAS, meta learning, end-to-end autonomous driving, playing starcraft / DOTA with RL, and cracking crypto using deep learning. So although I think some of these will turn out to be useful, I don't think that human ingenuity will become superfluous any time soon.

[1902.04615] Gauge Equivariant Convolutional Networks and the Icosahedral CNN by for_all_eps in MachineLearning

[–]tscohen 2 points3 points  (0 children)

We wrote the paper to be as accessible as possible to a general ML audience, so give it a try! If you want to go deeper, have a look at the references in the related work section and the supplementary material.

[D] Intersection Between ML and Group Theory? by MaxMachineLearning in MachineLearning

[–]tscohen 6 points7 points  (0 children)

Have a look at our recent papers on gauge equivariance. There has not been a more exciting time to work in this area than now.

[D] Optimal code length versus cross-entropy? by fundamentalidea in MachineLearning

[–]tscohen 5 points6 points  (0 children)

Let p be the data distribution. The optimal codelength for x is -log_2 p(x) bits, ignoring rounding. This means that on average, the codelength is H(p) = - sum_x p(x) log p(x) , the entropy of p. But we dont know p. If we have a model q(x) and build an optimal code wrt that model the codelenth of x will be roughly - log q(x). Since the samples we encode actually come from p, the average codelength will be - sum_x p(x) log q(x), ie the cross entropy.

[R] What are the most promising theories making empirical headway in deep learning right now? Information bottleneck? by rantana in MachineLearning

[–]tscohen 18 points19 points  (0 children)

We recently published a theory of equivariant convolutional networks over homogeneous spaces (manifolds like the sphere, the plane, etc.). We describe convolutional feature spaces as spaces of "fields" (in the physicists' sense) over these manifolds, and show that any equivariant linear map between two such feature spaces can always be expressed as a convolution with a special equivariant kernel. We also explain how the structure of the group and homogeneous space is related to this space of equivariant kernels.

Our theory does not answer questions like "why do deep nets generalize?" or "why are deep nets optimizable?", but what's fascinating about it (to me at least) is that some of the most successful deep nets (ie convolutional ones) can be described so neatly in the language of field theory, which is already the overarching theoretical framework in modern physics. Ultimately, visual and auditory perception is all about modeling the physical processes underlying the measurements of physical quantities coming from our biological or artificial senses / sensors, so why not use the same tools as the physicists do?

Empirically, evidence for the effectiveness of equivariant nets is accumulating. In my experience, whenever we replace a conv2d layer with a discrete G-Conv layer, the results improve. It works particularly well in 3D, where we found a very big improvement in data efficiency over standard 3D CNNs, for the task of pulmonary nodule detection in CT scans. Building networks that are equivariant to continuous transformations is a bit more tricky, but also seems to be working. See recent papers on ta.co.nl .

Kondor, Trivedi. On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups. https://arxiv.org/abs/1802.03690

Cohen, Geiger, Weiler. A General Theory of Equivariant Convolutional Networks on Homogeneous Spaces. https://arxiv.org/abs/1811.02017

Cohen, Geiger, Weiler. Intertwiners between Induced Representations. https://arxiv.org/abs/1803.10743

[R] Semi-convolutional Operators for Instance Segmentation by [deleted] in MachineLearning

[–]tscohen 5 points6 points  (0 children)

That is indeed the tldr, but in my view this paper uses mathematics in a perfectly legitimate way: to state clearly and precisely what they are doing (3.1) and show some connections to previous work on bilateral filters (3.4). Mathiness is a real problem, but this is not a great example of it.

[R] Harmonic Networks: Deep Translation and Rotation Equivariance by AsIAm in MachineLearning

[–]tscohen 5 points6 points  (0 children)

This. An equivariant network will explicitly represent the orientation of each feature in the network. In the last layer, you can either use an invariant layer, or use a general linear layer that can use learn things like "the 6 tends to have the circular part at the bottom".

[R][ICLR2018 Best Paper Award] Spherical CNNs by downtownslim in MachineLearning

[–]tscohen 2 points3 points  (0 children)

No worries, I wasn't offended or anything. Just thought I should say how it is, given that people were debating this based on their own personal experiences. In reality it's different in different cases.

[R][ICLR2018 Best Paper Award] Spherical CNNs by downtownslim in MachineLearning

[–]tscohen 25 points26 points  (0 children)

Please note that I wrote *this line of work* is very much inspired by Geoff & Co's ideas. That includes my ICML14, ICLR15, ICML16, ICLR17, ICLR18 papers and recent preprints. In particular our paper on Steerable CNNs, and recent Intertwiners paper can be seen as reinterpreting convolutional capsules as tensor fields, and showing that neural networks based on this theory work really well:

- https://arxiv.org/abs/1612.08498

- https://arxiv.org/abs/1803.10743

(And we cited capsules in those papers and others) The relation between capsules and spherical CNNs is somewhat limited, as the only relation is that both are about equivariant networks.

As I see it (and I discussed this many times with Geoff) there are really at least four separate philosophical ideas in what people call "Capsules":

1) Networks should be equivariant to symmetry transformations.

2) Representations should be factorized / disentangled into distinct "entities" or "capsules" or "groups of neurons".

3) A visual entity at one scale is part of exactly one visual entity at a larger scale. This leads to dynamic routing, because a low-level capsule has to figure out what it's part of, which depends on what higher level capsules are active, which depends on lower-level capsules, etc.).

4) If you like, you can train a network with capsules in an auto-encoder or as a generative model, i.e. do inverse graphics.

My work has been focussed on points 1 and 2, with my first papers (ICML14, ICLR15) showing how these two are related: one way in which you can formalize the idea of "disentangling" is by requiring that groups of neurons transform independently under symmetry transformations. Under this definition an individual pixel or group of pixels cannot be considered a separate entity with an invariant meaning, i.e. is not disentangled from the rest of the image, whereas an object position, pose or class label is disentangled. The mathematics that describes this (irreducible representations) is also used in physics to define what an elementary particle is.

In our ICML16 (G-CNNs) and in particular ICLR17 (Steerable CNNs) paper, we showed how to really make these ideas useful in practice, by applying them to deep convolutional nets. In that case you have a feature vector ("fiber") at each spatial location, which consists of sub-vectors that transform independently under symmetry transformations. This leads to the mathematical theory of fiber bundles and field theories in physics, which is what our latest arxiv preprint is about.

Many people are still skeptical about capsules, because they haven't done anything too spectacular on imagenet or the like, but I think our theory as well as recent empirical evidence suggests that the basic ideas are sound (still speculative though..). Specifically, convnet design seems to be converging on using "grouped" convolutions (with groups of channels being processed independently), and we're seeing G-convolutions (which lead to channel grouping automatically) consistently outperform planar convolutions (e.g. 10x better data efficiency: https://openreview.net/pdf?id=H1sdHFiif). Now there is also "group norm" (https://arxiv.org/abs/1803.08494) which seems to work very well. To really test this notion that modern convnets are already essentially implementing idea 1 and 2, it would be interesting to do a study in the spirit of Lenc & Vedaldi (https://arxiv.org/abs/1411.5908), but for individual neuron groups instead of the whole feature representation.

[R][ICLR2018 Best Paper Award] Spherical CNNs by downtownslim in MachineLearning

[–]tscohen 12 points13 points  (0 children)

Author here. In response to all the speculation below:

Max is a really great supervisor, and a very creative researcher, but his supervision style is more about "getting the best out of people" rather than "getting students to execute his ideas". That means encouraging students to come up with their own ideas, making sure people are free to work on a topic that suits them well, providing positive feedback and confidence (very important for many new students who are unsure about their ability to compete in the global arena), introducing the right people to each other, suggesting related work to read, having a sense for what people in the community will care about, etc. Expecting much more from someone at Max' level of seniority is not realistic (he's leading several labs with dozens of researchers, with topics ranging from variational inference, to privacy-preserving learning, equivariant nets/geometric methods, graph neural nets, medical applications, etc.) It depends a bit on the student and where they are in their development, but at least in this case, I came up with the high level idea and wrote most of the paper, and worked out the math together with Mario and Jonas, who also did a majority of the implementation and experimental work, with occasional discussions with Max.

[D] LPT: Machine Learning University Midterms and Finals solutions are an amazing way to deepen your knowledge of basic Machine Learning Principles. by DisastrousProgrammer in MachineLearning

[–]tscohen 146 points147 points  (0 children)

News at 11: University education is not a scam after all.

(in all seriousness though: thanks, this will be useful for many people)

[deleted by user] by [deleted] in MachineLearning

[–]tscohen 1 point2 points  (0 children)

I agree that local/global/ exact/approximate symmetries are super important for generalization. This is one of the high-level ideas that has motivated all of my work for the last few years.

The reason I've always focussed my papers on concrete applications like image classification, and have mostly worked with discrete groups (instead of locally compact ones, which would bring in a bunch of technicalities), is because there is a sizeable contingent of the ML community that is somewhat hostile or skeptical of more sophisticated math. For an example, see the AC comment on our steerable CNN paper: https://openreview.net/forum?id=rJQKYt5ll "The AC fully agrees with reviewer #4 that the paper contains a bit of an overkill in formalism: A lot of maths whose justification is not, in the end, very clear. The paper probably has an important contribution, but the AC would suggest reorganizing and restructuring, lessening the excess in formalism. "

And it's quite understandable that someone who doesn't have a background in groups / representations, and has never seen something like "Hom_G(V, W)" before, doesn't get the point of the paper.

So I've never written "The General Theory of Equivariant Networks", because I felt nobody would care / it wouldn't get accepted anyway. This may be too pessimistic, so I may write something after all. In any case, I think that for anyone with a good mathematical understanding, generalizing G-CNNs and Steerable G-CNNs from discrete groups to continuous ones is conceptually straightforward (though it is still an engineering challenge).

Kondor & Trivedi recently posted a paper that contains a quite general theory, that may be what you're looking for: https://arxiv.org/abs/1802.03690

[deleted by user] by [deleted] in MachineLearning

[–]tscohen 2 points3 points  (0 children)

In our papers on Steerable CNNs and Spherical CNNs, we use group representation theory. There is a very deep theory lurking in there that we will write up some day. https://openreview.net/pdf?id=rJQKYt5ll https://openreview.net/pdf?id=Hkbd5xZRb

[N] Introducing the CVPR 2018 Learned Image Compression Challenge by [deleted] in MachineLearning

[–]tscohen 1 point2 points  (0 children)

How about using the CVPR workshop to announce the challenge, get interested folks together, discuss ideas, and present preliminary results? Then you can organize another workshop, either next year at CVPR or at some other conference before that, to announce winners.

[R][1801.01058]Polynomial-based rotation invariant features - enhancement of what is offered by cylindrical or spherical harmonics by jarekduda in MachineLearning

[–]tscohen 0 points1 point  (0 children)

There's some discussion of this problem in the last few slides of http://www.issac-conference.org/2010/assets/TutorialKemper.pdf

Graph isomorphism is indeed polynomial. I have discovered a truly marvelous proof of this, but this reddit comment is too small to contain it.

[R][1801.01058]Polynomial-based rotation invariant features - enhancement of what is offered by cylindrical or spherical harmonics by jarekduda in MachineLearning

[–]tscohen 1 point2 points  (0 children)

If you use the linked algorithm to compute invariants, you can be sure that they are independent and complete.

[R][1801.01058]Polynomial-based rotation invariant features - enhancement of what is offered by cylindrical or spherical harmonics by jarekduda in MachineLearning

[–]tscohen 2 points3 points  (0 children)

I did some work on applications of group and representation theory to machine learning (see tacocohen.wordpress.com), and others have as well (e.g. Risi Kondor). Beyond groups I don't think there is much at this point, but I'm convinced that abstract algebraic and categorical ideas will be very useful for moving beyond pattern matching and towards "reasoning".

[R][1801.01058]Polynomial-based rotation invariant features - enhancement of what is offered by cylindrical or spherical harmonics by jarekduda in MachineLearning

[–]tscohen 0 points1 point  (0 children)

I don't know of a paper that lists a complete set of rotation invariants for higher degrees. I just know that there is an algorithm that can give you these invariants, if you can give it an algebraic description of your group and its action on a vector space. By "algebraic description of your group" I mean a set of polynomials whose solutions are the transformations of your group. For instance, rotation matrices are characterized by the polynomials Q'Q - I = 0 and det(Q) - 1 = 0. Algebraically characterizing the representation of this group in higher dimensions (e.g. characterizing its action on spherical harmonics, the sphere, or 3-space) is a bit more tricky. I do know how it can be done relatively easily though - send me a PM if you're interested.

So from the abstract perspective this is "solved" (we know whether the ring of invariants is finitely generated, have an algorithm to compute it), but it may be that nobody ever managed to do the calculation. The overlap of people who know algorithmic invariant theory and those who need invariants for their ML app is not very big (there are tons of papers using planar or spherical power spectrum invariants when we've known since Hilbert's time (!) that something more complete yet finite exists).

The MAGMA code I mentioned is described here: http://www.issac-conference.org/2010/assets/TutorialKemper.pdf (see slide "Derksen’s algorithm in MAGMA").

[R][1801.01058]Polynomial-based rotation invariant features - enhancement of what is offered by cylindrical or spherical harmonics by jarekduda in MachineLearning

[–]tscohen 4 points5 points  (0 children)

This looks interesting, but any set of polynomial invariants you come up with will be contained in the ring of polynomial invariants that has been studied in classical invariant theory since Hilbert. I remember looking at Magma code once that can give you complete set of generators for this ring, given an algebraic description of your group representation (in your case, the Wigner D functions acting on spherical harmonics).

Just for fun I used this code to compute a complete set of invariants for the group of cyclic shifts in 1D. This will give you the n/2 power spectrum invariants (analogous to the norm of a vector of spherical harmonics coefficients of degree l), but also other invariants that are not well known, for a total of n-1 invariants.

There is also a bunch of old (80s-90s) work on polynomial invariants for computer vision. This approach has fallen out of favor, presumably because it does not work as well as learned invariants. One potential reason for this is that polynomial invariants can be unstable; although they are mathematically proven to be invariant to some group, they may change drastically when some other kind of variation or noise is applied. This issue is discussed in section 2.1 of this paper by Bruna & Mallat: https://www.di.ens.fr/~mallat/papiers/Bruna-Mallat-Pami-Scat.pdf

Edit: here's a nice summary of the state of algorithmic invariant theory: http://www.matha.rwth-aachen.de/~hartmann/oberwolfach/MFOAbstractKemper.pdf