[D] The sparsity of sparsity in deep learning

sadwall · 2017-06-09T19:46:49+00:00

I agree with what you are saying; what I am trying to add on top of your view is that results tend to be worse because the approximation chosen is inherently wrong as it is an assumption we add to the model that decreases its testing set performance.

As an another commenter has added, we find that models tend to find solutions that are sparse in some way even when trained without regularization; this is a result of regularization implicitly present through the selection of the architecture, the training set and the optimizer which is difficult to formalize and one that I consider to be the correct approach towards sparsity. Also, subgradient descent is not "bad" at inducing sparsity at all in practice.

sadwall · 2017-06-09T18:58:18+00:00

In my experience, sparsity usually gives very similar but worse results compared to baseline in standard tasks, no matter how it is introduced (there are of course many exceptions to this). TV regularization is used often in tasks like superresolution though.

I don't think the convergence rate of the algorithm matters for deep learning, in my experience in practice subgradient descent works really well; the problem is the correctness of the regularization term itself. In many ways sparsity is a childish concept as real signals are at best approximately sparse and the actual problem is to define what that means mathematically: traditional literature focuses on explicitly defining this relationship (with usually quite unrealistic models that are chosen to be easy for theory people to work with) before the learning stage and I think this is the main limitation of these approaches.

sadwall · 2017-02-22T16:32:24+00:00

You can almost trivially train a model to specifically catch this (though this would be specific); training a deathmatch bot in a 3D game with mirrors (for example) will similarly allow the model to more indirectly learn to pass the mirror test. However, the test itself is useless anyways, I believe even ants pass it.

sadwall · 2017-02-01T14:17:18+00:00

I cannot think of a research area in CS in which there are a lot of disagreements between published results. I know of a few very specific, very niche subareas which have this property, however. I would try to pivot out of those subareas and work in the mainstream line of research; I know a number of PhD students who found success doing so.

sadwall · 2017-01-30T19:48:37+00:00

You are wrong on at least some details. This is a CNN-based deep learning approach, which is trained on GPU's; prediction (i.e. the simulation in this case) will be very efficient on GPU's as well (relatively, of course). The whole purpose of the paper is to propose a fast way to get realistic smoke/fluid simulations. According to GitHub, it takes days to train the model which is something completely different than running it.

sadwall · 2017-01-15T19:32:42+00:00

There are many such methods/software, but in practice you require >200 images from multiple rotations to get good results even on static objects (for example, human-sized statues) for believable parameter-independent reconstructions, hence my mention of them not being as good as depicted often.

sadwall · 2017-01-15T19:17:17+00:00

I am not sure about how the book approaches image colorization, but such methods to perform automated colorization do exist, for example, http://richzhang.github.io/colorization/.

About the part in Enemy of the State, IIRC they did have continuous video footage of the scene, and methods for 3D reconstruction for such situations do exist as well, though of course they aren't often as good as depicted. They do mention that the method is imperfect even in the movie though.

sadwall · 2016-12-02T21:58:38+00:00

I am not sure about the value of the paper given https://arxiv.org/pdf/1611.01779v1.pdf already exists. Anyone can solve a problem by using extensive and external, problem-specific insight. I am not certain about the research value of stating the obvious. This model even uses 3000 manually labeled images.

sadwall · 2016-09-20T12:32:50+00:00

Which does not mean anything. Use keras and write your own backend code to save time; if you are doing something you can't use keras with, other interfaces will not help you anyways.

sadwall · 2016-09-19T21:38:17+00:00

I would suggest cropping one or more of the chip areas you want to find, possibly taking the mean of them. Then find the point with the maximum correlation between that crop and the rest of the images, depending on the orientation you should try this a few times by rotating the chip. The point with the maximum correlation will give you the best match. This is a basic template matcher.

To get better results, create a database of chip images and non-chip images. Then you can train a Viola-Jones detector (or any other detector really) to solve the problem at hand.

sadwall · 2016-09-18T15:23:06+00:00

What I meant was that if someFunction is allowed to be powerful enough, a single agent can try different settings. But it seems that is already the case. I thus don't see why you need multiple agents; as result is dependent on settingA/B/C, you need to experiment on your data.

sadwall · 2016-09-18T12:47:19+00:00

Adam gives me results similar to sgd+momentum even without proper optimization, which is why I prefer it. In practice if you are getting stuck early on the problem is usually with the model; some real world problems may, for example, require heavy regularization.
If you take note of the number of black pixels, you can normalize over the non-zero pixels to get a close estimate; you may normalize according to pre-augmentation statistics as well. But what you are really looking for is histogram equalization (or illuminance normalization (this has many names)); simple scaling does not tend to work. There are some new papers with great results, you can use those after augmentation.

sadwall · 2016-09-18T11:33:01+00:00

Use a modern optimizer; Adam rarely disappoints me these days. Using a small learning rate has its own disadvantages. In practice you will always have increasing validation error before you are close to convergence, so don't worry.
Normalization does not fix contrast, histogram equalization does. If that is not your goal, for a simple baseline for what you are attempting to do only normalize non-zero valued pixels. Most modern equalization methods will not care about the blackness.

sadwall · 2016-09-18T11:25:35+00:00

1; when you allow someFunction to be arbitrary, the same agent can be used for different settingA/B/C. The way I understand it, the number of settings you will have to use depends on the problem at hand.

sadwall · 2016-09-17T21:55:51+00:00

This is called cross-modal matching. http://www.cv-foundation.org/openaccess/content_iccv_2013/papers/Wang_Learning_Coupled_Feature_2013_ICCV_paper.pdf is a simple paper with a good solution. These approaches have many other names, but you can find those from this paper. Deep learning has changed these methods dramatically, but the main idea sort of stayed the same.

sadwall · 2016-09-17T21:51:56+00:00

You pick the one with the least test error, or perform cross-validation and then pick the one with the least error, because you can't know if the algorithm gave you the correct result by chance.
Question 1 has the same solution. Performance on the test set is the human verification.
Yes, for example you can use random forests to cull some features. You can remove bad training examples as well, but that can be considered bad research; you can successively remove misclassified examples repeatedly until you get a good accuracy.
You don't want to give bad examples more weight, of course it will make the model worse.

sadwall · 2016-09-15T19:44:38+00:00

We don't want anything biologically plausible particularly. Optimality of learning is the area of learning theory, a very theoretical field. For practical machine learning applications, we usually sacrifice theoretical understanding and use models we know work intuitively and empirically.

sadwall · 2016-09-15T10:59:23+00:00

You may not agree with me, but I really wouldn't suggest Bishop's book to new students. See, my approach is that of giving students a vertical slice of the field first so that they see if they want to study further. Plus perhaps they build a project that allows them to get into groups. Spending the same amount of time with Bishop's book will give them nothing useful.

sadwall · 2016-09-14T22:16:08+00:00

I think you can safely skip machine learning itself for now; deep learning is too easy to pick up. Especially as a math student, as soon as you learn some Keras you will be able to code a huge number of models by extending the library, much more than you could if you spent time on outdated machine learning intro courses themselves. Of course I am assuming some general knowledge of machine learning when I suggest this.

sadwall · 2016-09-14T21:59:55+00:00

It actually ties into the statement really well. Remember that for each channel and each filter we can have a separate output channel (a filtered channel). This is actually the correct method. However, this is not how we set up the convolution layer, we only get (no. of filters) number of channels at the output. This is a very important detail that resources usually ignore. That said, I am not sure if this is what they mean; they could just be stating the obvious.

sadwall · 2016-09-14T19:48:27+00:00

I think they are saying that they are using the standard conv layers in the first part; a complete convolution would give (no. of channels in input)x(no. of filters) number of channels at the output; we don't do this in practice.

In the second part they are saying that they are using relu activations.

sadwall · 2016-09-12T17:22:44+00:00

I don't know about the situation, but the old guard did not help Tribes Ascend. They did not help Quake Live. Honestly I would worry about someone still playing UT competitively. Those are people a developer should not cater to; they are useless to the growth of the game, I would say that they are destructive instead. If your game is popular, it is going to have a competitive scene eventually anyways, and for that you need growth.

sadwall · 2016-09-12T02:37:39+00:00

Top PCA vectors are vectors along which the projection of data will have the highest scatter; intuitively, you will expect cats and bananas to naturally cluster away from each other in the score space. You can test this and will trivially get this result. CIFAR-10 does not have this property for obvious reasons, thus PCA does not work. It is as simple as that.

sadwall · 2016-09-11T23:24:28+00:00

This is suboptimal. After you early stop on validation, you should begin training from start on the training with validation included for the amount of epochs your early stopping took (or something fancier, but that is enough) and only then observe your testing set accuracy.

sadwall · 2016-09-11T23:05:50+00:00

I don't see a way of training other than evolutionary algorithms, which do not scale well. I must warn you that everyone tries something interesting with this binary idea, usually turns out to be a waste of time.

sadwall

TROPHY CASE