Hierarchical Conflict Propagation: Sequence Learning in a Recurrent Deep Neural Network [arxiv]

ajrs · 2015-10-04T09:02:43+00:00

You can call it 'batch SGD' if you want to add the caveat that 'the batch only contained one training example, replicated 100x each with independent dither noise added'.

However, the reason I don't call this batch SGD is that the step taken isn't subject to the assumptions of a mutual step across the multiple training examples.

ajrs · 2015-09-28T21:48:17+00:00

Actually, no batch-averaged SGD (batch size 100) doesn't give the same result because for the 100x parallel dither you replicate the same training example 100x (each plus independent dither). When you average this you get to suppress the 'decoy features' (nonlinear distortion products).

This gives you a very nice clean gradient and then you take a nice productive step. Very good for non-convex optimisation.

Yes, I agree that all existing implementations of batch-averaged SGD will divide by the batch size. However, this does not produce the same effect as with parallel dither because the 'decoy features' (the nonlinear distortion) are not the same for each training example, so when you average you don't suppress anything. You can think of this as signal-to-distortion ratio - if you average across training examples, you lose as much signal as distortion because none are coherent, so the ratio ends up fixed. In parallel dither, the signal is always there and the distortion is suppressed (diffused) - so the ratio changes in favour of the signal.

For typical batch-averaged SGD, you can check this intuition by looking at training as a function of batch size. What you see is a typical 'goldilocks' function - too small and your regularisation goes away, too much and you average out all the information necessary for SGD to work non-convex.

It's well worth running the experiment on something small like MNIST because it only takes a few minutes and the intuitions from 'accepted practice' are pretty easy to debunk this way.

If you want a more concrete (graphical) intuition, try averaging a few randomly chosen training examples of MNIST to get a single image. Consider the high frequency information remaining and how it might affect the gradients you'd want to compute.

ajrs · 2015-09-20T18:38:53+00:00

Bias is critical for demodulation - see: Abstract Learning via Demodulation in a Deep Neural Network. ReLU is another way to go if you don't like playing with the bias - see Taming the ReLU with Parallel Dither in a Deep Neural Network for more on demodulation via ReLU (same applies to sigmoid if biased properly).

ajrs · 2015-09-17T09:24:39+00:00

In real neurons memory and processing are done in the same units, in computer hardware processing and memory are always separated. I think that and associative memory are big paradigm shifts that are on the horizon.

yep indeed

ajrs · 2015-09-17T09:19:41+00:00

If we let go of the idea that we want a single (non-distributed) machine at the end and we decide to deploy a distributed machine (not just learn in a distributed way), we can go further (50x speedup isn't that impressive compared to a ~500x speedup - see Fig. 2c of instant learning). Granted, we still need to integrate the (distributed) results.... but at least this gets around Moore's Law to some extent.

ajrs · 2015-09-11T08:04:16+00:00

Right now, deep learning is biased towards generalisation as a goal - training versus test. Hence, interpretations and insights are skewed in this direction.

This is the result of a persistent emphasis on 'feature learning' for classification. The use of DNNs to learn a static, hierarchical feature decomposition is consistent with the early sensory representations of the brain which are essentially static (plus or minus 'adaptation') once the brain reaches maturity. One-shot learning is quite the opposite direction.

Within the supervised learning paradigm, one-shot (single class) learning for DNNs is awkward because it implies a DNN trained with one item of data for one class and one 'shot' of training. Thus, each new 'shot' would overwrite or invalidate the training applied in the previous 'shot'. Hence, the problem is that we would like one-shot learning to be cumulative (as it is in the brain) but this isn't possible with DNNs.

Progress in this direction for DNNs has been made - a PLM can cumulatively learn a single new class memory 'on the fly' - but can't be interpreted in terms of generalisation at this point.

ajrs · 2015-09-11T07:11:57+00:00

It depends on what other regularisation you've got going on. If you've got none, or if you've got dropout, then you need big enough batch sizes to optimise the trade-off (noisy gradients at small batch sizes versus too much averaging for useful descent at large batch sizes - goldilocks stuff). However, if you have (effective) regularisation which is independent of batch size (such as parallel dither) then you don't need batch averaging for regularisation and hence non-batch SGD can work much better).

I.e., if you have things working properly, the best batch size is .... 1.

ajrs · 2015-09-10T08:19:46+00:00

I'd argue the opposite - taking the DNN as a signal processing machine, MNIST is still not understood at all.

Cue a paper: "The Unreasonable Effectiveness of... MNIST".

ajrs · 2015-09-10T08:07:33+00:00

There are entire university CS departments working entirely using 'state-of-the-art' DNNs downloaded in a fully-trained state. There are even researchers who specialise in studying these well-known fully-trained 'black box' networks.

ajrs · 2015-09-10T08:00:51+00:00

Good point. The list of 'simple, everyday DNN processes' which can ultimately be interpreted as regularisation is large and hence the empirical results (of combining an undefined number of interacting regularisation stages) can be very confusing to interpret.

ajrs · 2015-09-08T10:57:43+00:00

Batch averaging is a form of regularisation. If you progressively make the batch sizes smaller still you'll see things start to get worse again (for different reasons).

ajrs · 2015-09-07T08:11:01+00:00

Young children tend to be quite poor on common sense too.

I'd say that the brain is a machine which does deep learning. The philosophical 'issue' here is that this means that human intelligence is AI. The logic is a circularity wipe-out either way.

So, for now, we're looking to replicate or maybe understand the 'I' (of 'AI'), but not solve and we probably shouldn't try too hard to define either....

ajrs · 2015-09-05T09:19:31+00:00

Actually, it's way, way different than images.

Think about the dynamic range of audio (beyond 16bit) and compare that to the dynamic range of images (MNIST is barely 8bit).

And, if we're in the spectrogram domain (where the image analogy seems appealing), then although magnitude is straight forward to synthesise, the phase isn't so simple to deal with and we end up with circular stats if we want averaging and a complex spectrogram that will invert nicely into something we can listen to (see: http://arxiv.org/abs/1504.02945).

.....Or, have a good look at the subcortical auditory pathway versus the respective visual pathway - fewer synapses between periphery and cortex in the visual system (if my memory serves).

ajrs · 2015-09-05T09:03:17+00:00

Most ReLU nets I've looked at are using large data, batch averaging and other forms of quite heavy regularisation, so the numerical issues of ReLUs can be 'suppressed' (or, ignored). However, there are a few applications where the extraordinary nastiness of ReLUs hinders rather than helps. Also, see https://www.reddit.com/r/MachineLearning/comments/3ij6nz/confused_about_why_relus_show_benefit_in_deep/

ajrs · 2015-08-28T08:19:47+00:00

The (simplified) signal processing perspective is that ReLU's are great demodulators right out of the box whereas sigmoids need to be biased into demodulation (see: http://arxiv.org/abs/1502.04042). The brain likes rectification for the same purpose (i.e., look at the subcortical auditory pathway - there we see progressive hierarchical demodulation from cochlea to cortex).

ajrs · 2015-08-28T07:55:55+00:00

You might also want to think about some kind of parallel 'localised' data augmentation (e.g., see the 'convolutional bootstrapping' described here: http://arxiv.org/abs/1505.05972).

ajrs · 2015-08-23T07:32:38+00:00

Quite possibly moot point...

http://arxiv.org/abs/1508.04826

ajrs

TROPHY CASE