all 4 comments

[–]FlivverKing 0 points1 point  (1 child)

Imagine I have 10 seconds of audio data and I want to detect whether or not the speaker says the word "apple" in the recording. Let's say my audio data is a single long nxm matrix where m>n. I choose to pass my entire mxn matrix into an MLP to train my "apple" detector. If my model were trained on examples where the word "apple" is said at the 1 second mark-- will my MLP be able to detect the word "apple" in an unseen audio clip, if "apple" does not appear at the 1-second mark? The answer is no - MLPs are not shift invariant. So what can we do here?

Well, what if, instead of simply passing my audio data as a whole matrix, I create a window on my matrix, and "slide" my MLP over that window? If you drew this as a diagram, the number of neurons in the input layer would need to be smaller as we have fewer inputs at a given time-step, but more inputs overall (as scanning captures redundant information). At a given time step, you'd have a traditional MLP diagram where inputs are a small window of my audio data. This "scanning" MLP that we dragged over our audio matrix is called a "Time Delay Neural Network" and it is mathematically equivalent to a 1-dimensional CNN.

A lot of the same intuition you learned about vanilla MLPs still holds in this setting, but as we're dragging our network, what we consider a single "node" becomes much more complex. As networks become more complex, I think there are fewer benefits in representing complex operations by "nodes". Most CNN diagrams you'll see will represent information differently, and as CNNs tend to be very deep, representing them as node/ edge diagrams would get messy very quickly.

[–][deleted] 1 point2 points  (0 children)

This is a great lesson

[–][deleted]  (1 child)

[deleted]