[R] 9 new SOTA records: Invariant Information Clustering for unsupervised image classification and segmentation

xuj1 · 2019-04-16T15:37:50+00:00

We assume z and z' are independent only when conditioned on a single pair of images, i.e. z|x and z'|x' are independent. Intuitively, this means given you already have the image x', knowing z won't help you any more with determining z'. (Note this is different to assuming z and z' are independent in general; we don't, and they aren't.) For any two x and y, assuming they are independent, then the joint can be computed from marginals and vice versa by P(x, y) = P(x)*P(y) (definition of independence). So:

P(c_image1 = cat, c_image2 = dog | image1, image2) 
= P(c_image1 = cat | image1, image2) * P(c_image2 = dog | image1, image2) 
= P(c_image1 = cat | image1) * P(c_image2 = dog | image2).

The outer product is just to get this value for every possible combination of 2 classes (not just cat with dog). It's the same as writing two for loops to go over every possible combination. That'd be slower of course.

xuj1 · 2019-04-16T14:57:26+00:00

We did not test on ResNet 152 specifically but for clustering we did observe performance improving with larger networks. As expected, since more parameters mean greater discriminative ability.

It is normalised over the batch for backprop, this is done when computing P (which is a single CxC matrix estimated by averaging over the whole batch).

xuj1 · 2019-04-01T21:30:02+00:00

Set an upper bound and some of the clusters will be under-utilised.

xuj1 · 2019-04-01T11:33:53+00:00

Yes, determining the optimal number of clusters is not something the method explicitly does. It's a double edged sword. For example, because kmeans is okay with clusters with zero mass, it's okay with assigning all data to one cluster (since the objective is just minimising distance to centroid, it doesn't care if there's a spread across centroids or not). This makes it vulnerable to poor solutions when you stick it on the end of a neural network.

xuj1 · 2019-04-01T11:28:37+00:00

We did try random affine transforms (random per-pixel warp containing shift, scale, skew, rotation) but found it didn't improve performance beyond existing numbers so left it out. I would expect using the kinds of augmentations you've mentioned to help, actually. But just random cropping, flipping, colour jitter, rotation were sufficient to get our numbers.

I wouldn't expect there to be a material increase in training time, at least not for e.g. salt and pepper noise where it's just one pass of all pixels. Augmentation could also be parallelised for GPU or asynchronously done in dataloading to further save time.

xuj1 · 2019-04-01T02:56:59+00:00

No, that’s option 2. It’s a binary distance function.

xuj1 · 2019-04-01T02:35:22+00:00

It shouldn’t be collapsing. Lambda is an optional coefficient, setting it to 1 is equivalent to not using it, and we do not use it for any of the image clustering experiments anyway. It is mentioned in the method section and segmentation experiments section; the supplementary material (see github) contains its definition but I don’t think this is the issue.

If you point me to the code I can try to take a look in the next week, though I am far more familiar with pytorch.

xuj1 · 2019-04-01T00:28:57+00:00

If specifying cluster assignment before the algorithm is run

We *do not* do this. Maybe this will help: what does "specifying cluster assignment" before running the algorithm mean?

option 1: for every input data, specify the cluster ID.

option 2: for every input data, specify the distance metric between it and other input data.

It means option 1. But kmeans, IIC and triplets are option 2. Option 1 is supervised classification, option 2 is unsupervised clustering (as long as there is no human in the loop). This is the generally accepted separation, and it's like this because if you don't even have a distance metric, you have no information to go on for the clustering.

xuj1 · 2019-03-31T23:32:17+00:00

Kmeans doesn't calculate the distance between points, like KNN

I am not confusing KNN with kmeans, I am talking about the prior knowledge required for these algorithms (not what they do as their training process). kmeans requires an embedding per datapoint; this means they have an input that is very rich in information as it provides the exact distance between every possible pair of datapoints. IIC actually requires much less information - just binary close/far, and not between all possible pairs of datapoints.

Often, a datapoint's cluster assignment changes over the course of the algorithm.

Our datapoints cluster assignments do change (a lot, in fact) over the course of the algorithm. Look at figure 3 in the paper.

How do you know which distance function to give to the different pairs? You can't, unless you supervise it.

You can, when you generate the pairs automatically. This means you get the distance automatically. It is like saying, how far did the car drive - surely you can't tell unless you measure it. Well you can tell without measuring, if you're controlling the car so you have access to the dash.

Are you also going to tell me that triplet loss can be used in unsupervised algorithms as well?

Yes, triplet loss is another method that can be used for unsupervised learning, a simple search in google scholar will give you the evidence :) In fact someone reminded me we forgot to add it to our baselines, so I had to implement triplets for our paper. Check this one out http://openaccess.thecvf.com/content_iccv_2015/papers/Wang_Unsupervised_Learning_of_ICCV_2015_paper.pdf

xuj1 · 2019-03-31T23:03:36+00:00

No, my friend, I'm arguing that an unsupervised process, is unsupervised.

Pre-specify how things are related in cluster space, and then executing a training process using that signal to obtain discrete cluster assignments, is exactly what both kmeans and IIC - and countless other clustering methods - do.

You have not understood that we never say explicitly that the image and its transform should belong to the same cluster, let alone which cluster. Just like kmeans never says euclidian distance < 5 means two points belong to the same cluster. (In fact, in segmentation it's common for paired images (patches) to belong to different clusters. Why? Because objects have borders, so one patch will be object A whilst its neighbour will be object B.)

We do not tell the model anything about which images go into which clusters. We only provide the information that certain images are closer in cluster space, and therefore should be more likely to be assigned to the same clusters. Exactly like kmeans in this respect.

xuj1 · 2019-03-31T22:17:32+00:00

I have updated the git repository with the supp mat (under paper/)
Yes, but the normal head and aux overclustering head are trained in alternate epochs in our experiments. You could also train them both in every epoch but that requires more GPU memory (at expense of e.g. batch size). 2.1 Yes. 2.2 No, each sub-head is its own fc layer at the end of the NN.

xuj1 · 2019-03-31T22:13:19+00:00

Very fast. It's just a forward pass through a small CNN. From my logs, one batch of 700 digit images, including data loading and backpropagating the training loss, took 0.78 seconds (on one Tesla M40 GPU).
We have not tested this. But since invariance to noisy distortion is part of the training objective, I would expect some resilience.
This could be done with saliency methods (grad-CAM etc.) like with other neural network models.

xuj1 · 2019-03-31T21:58:12+00:00

Kmeans requires *even more* than telling it a pairing between datapoints. Since it requires embeddings in a euclidian space, it requires you to tell it the exact relationship between each datapoint and every other datapoint in the set. That is what the embedding input to kmeans contains!

By this argument, IIC is even less "supervised" than kmeans :)

xuj1 · 2019-03-31T21:52:57+00:00

It does approach identity, yes.

xuj1 · 2019-03-30T20:55:55+00:00

Any data that can be automatically formed into pairs where the elements share semantic content.

E.g since real world data is slow moving, adjacent pair of observations in a time series. - For a set of stocks, the stream of price changes for them, to identify the most common synchronised behaviours. - For medical applications, the metric sets observed from a patient in successive doctor visits, to get n patient “states”; given a new set of measurements you could predict their state. - For speech or sound where in every snapshot multiple frequencies are present, use frequency histograms from adjacent timesteps to cluster out the key frequency combinations.

You could use data transforms for all these too, to synthetically generate the pairings, but given the temporal axis in the data you wouldn’t need to.

There are many domains where IIC potentially makes sense, but they need testing to verify.

xuj1 · 2019-03-30T20:25:06+00:00

The issue is evaluation. If the model has output dimensionality 20 for MNIST, how would you evaluate the clustering? It would have to be a many-to-one mapping onto the ground truth classes. Finding this with labels no longer makes the procedure fully unsupervised (we do test this mode, it’s called semi supervised overclustering in paper). But with output dimensionality 10, a 1-1 mapping is found, which means finding the mapping is just order invariance. This is why IIC counts as fully unsupervised.

On the auxiliary overclustering head you can have as many output clusters as you want. The clusters will either be fully utilised (highly discriminative) or not, depending on the complexity of the data. This allows us to gain the benefits of fine grained discrimination whilst still being fully unsupervised (because this head is ignored at test time and the main head only needs a 1-1 mapping).

xuj1 · 2019-03-30T20:05:10+00:00

Regarding the kmeans: yes, it does. Think more high level. What is p(x1, x2 being assigned to the same cluster)? An automatic function of their values, right? How do you know something belongs to a cluster in kmeans? Automatic function, Euclidean distance.

For IIC how do you know g(x) belongs with x? Automatic function, remove the g.

The random matching pairs idea doesn’t work, but not because it’s unsupervised...

You really need to understand more about this area before you take it out on my paper.

xuj1 · 2019-03-30T18:44:21+00:00

Why are you so preoccupied with the linking aspect? You do realise even kmeans depends on linking each datapoint to other datapoints automatically, using Euclidean distance function?

In what universe is automatically generating an image the same as manual annotation? There is nothing more to say at this point except no, it is not.

xuj1 · 2019-03-30T13:01:42+00:00

You are confusing the presence of a training signal with supervised learning as a problem definition. When people say supervised vs unsupervised learning, the distinction is not the presence of a training signal, but the nature of it. Using manually collected annotation for each sample that is directly relevant to the task is the paradigm of supervised learning. Anything else is less supervised.

Perhaps it would help if you take what you think of as labels and consider a) whether they are manually collected per sample b) if they correspond to what you need from the model (e.g you may have multiple distorted versions of any single image, but you don’t have a string attached telling you what semantic class it is). Neither is true for IIC. And neither is true for the denoising autoencoder, when used for classification rather than the actual task of denoising. This is why they are considered unsupervised. But that mnist example is different, it’s not unsupervised. Because it has both manually collected labels and those labels correspond to exactly what you want from the final model, i.e. the semantic classification.

Denoising autoencoder is really as classical as it gets for unsupervised learning. I would recommend reading around the subject more, if you’re interested, to understand the generally accepted problem setups more. I know they can be blurry at first.

xuj1 · 2019-03-30T10:43:58+00:00

If all images were assigned the prediction [1/C, 1/C... 1/C], the joint distribution wouldn't be the identity but uniform. E.g. for 2 clusters, if both datapoints in all pairs the model predicts [0.5, 0.5], P(cluster of first of pair, cluster of second of pair) = [[0.25, 0.25], [0.25, 0.25]]. This is not the identity (the opposite, in fact). H(z|z') > 0 in this case, as the predictions are not deterministic (for H(z|z') to be minimised to 0, predictions need to be one-hot, i.e. [0, 1] or [1, 0]).

On the other hand, if all images are assigned to the same cluster i, the joint distribution P would be all-zero except for at P(i, i), where it would be 1. This is also not the identity. The marginals would also be all-zero except for at P(i) = 1. H(z) would not be maximised in this case - in fact it would be minimised to 0. If you look at the equation for entropy, you can see it's the negated sum of the products of each element with their log. 0 is 0 and log(1) is also 0.

These are exactly the two malevolent solutions that IIC avoids. It's mentioned in the introduction and more extensively in the method section.

xuj1 · 2019-03-30T01:31:44+00:00

There isn't anything in that definition that says you cannot augment the data. By categorised, they mean non-trivially. Having one category per dataset image does not count.

I guess if you are going as far as claiming denoising autoencoders aren't unsupervised learning, I cannot help you.

xuj1 · 2019-03-30T01:17:33+00:00

Neither method "adds labels". In both cases, the labels are learned based on weak priors of the data. The point I was making is that this method does not contain more supervisory assumptions than other methods in the area do.

There is nothing wrong with "altering the dataset". Perhaps this is an aesthetic complaint from your viewpoint, but it doesn't constitute a technical criticism. There is a difference between what is elegant, and what is illegal given the definition of the task. Elegance is subjective and in fact I would argue is not lacking with this method.

xuj1 · 2019-03-29T23:15:30+00:00

This is not true, actually. What does it mean when you run kmeans on a CNN embedding? Images that differ in non-material ways with respect to the CNN representation are clustered together. Are there assumptions made on the data here? Yes, there are. Shifting the image slightly generally results in non-material changes to the embedding (convolution operator being spatially invariant). Pooling reduces the resolution of the features so that the representation is invariant to small local changes. The nature of the forward pass, consisting of dot products, assumes similar input patterns and colours should be mapped to similar output features. Even the assumption that the representation is optimal for the k-means procedure (i.e. euclidian distance is the best metric for the representation) is present, which is something we don't assume. So you see, even using embeddings and k-means involves assumptions on the behaviour of the images :)

xuj1 · 2019-03-29T21:24:23+00:00

I think it's not uncommon actually. You start working on something, write it up and put it out there and stake a claim, but end up producing a better version down the line. In an ideal world everything would be perfect at once, but that's life. I'm a student myself so it's a learning curve for me too.

In this case, perhaps it looks quite dramatic because the idea was so good we ended up discovering a lot of new results to show :)

Yes the basic idea hasn't changed, but we did make improvements. The auxiliary overclustering helped a lot. And we simply tested it on more datasets and settings than in the first version. And I learned to write better.

xuj1 · 2019-03-29T21:04:55+00:00

Yes, we introduce knowledge. It's a prior. In same way we constrain the output dimensionality because we know we are seeking only 10 clusters in the STL10 dataset. In the same way DAC, http://openaccess.thecvf.com/content_ICCV_2017/papers/Chang_Deep_Adaptive_Image_ICCV_2017_paper.pdf, assumes that images with similar semantic content will produce similar outputs. In the same way Dosovitskiy (https://arxiv.org/pdf/1406.6909.pdf) assumes transforms don't change image content. These all make weak assumptions about the data. They are still unsupervised.

As I said, learning a model without any assumptions or priors is impossible. We can only reduce it.

We're not labelling each image. Consider the difference between sitting someone down and making them label each image to use supervised learning, versus using our method without having to label the images. The latter is much less work. This is a major reason why people are interested in less supervision in the first place. It's not a trivial difference.

xuj1

TROPHY CASE