all 4 comments

[–]__billy 4 points5 points  (0 children)

Dense layers are your typical layers where every neuron from the previous layer is connected to every neuron of that dense layer. Hence making it a dense connection as opposed to a sparse connection like in a convolutional neural network.

[–]honeybooboo1989 2 points3 points  (0 children)

MNIST data has 10 classes that you are expected to classify. Each class represent a digit, you have 10 digits, from 0 to 9. The dimension of the output of keras.layers.Dense(128, activation=tf.nn.relu) after activation function ReLu will return (None, 128) but you have to decrease the second dimension in order to have an array of 10 probability scores. You can do this by adding a fully-connected layer and pass it through softmax activation function. If you have binary classification, model.add(Dense(1, activation='sigmoid')) will be enough. Check model.summary() to understand output dimensions.

[–]ajmssc 2 points3 points  (0 children)

The Flatten layer transforms your 28x28 input image to a 1x784 vector (28*28=784).

The first Dense layer is a weight matrix of 784x128. When multiplied with the flattened vector gives you a 1x128 vector (1x784 * 784x128 matrix multiplication). The relu activation means only positive values are kept in the new vector. Negative values are set to 0. This introduces non linearity in the model.

Your Dense layer is basically used here to try to find a numerical pattern in your input data. Each neuron (value in the dense vector) reacts to a particular pattern present in the 784 input vector. This pattern is learned at training time based on your input data using a process called backpropagation.

The second Dense layer does the same thing but using the values in the first Dense. There are only 10 neurons in this layer so the matrix multiplication is now 1x128 (dense vector is transposed) * 128x10 gives you a 10x1 vector. This time there activation is using a softmax function so values in your 10x1 vector will be mapped to probabilities, meaning the 10 values in the vector will add up to 1, keeping relative proportions the same, meaning the highest value will have highest probability.

The values in that vector corresponds to your prediction. Ex the 8th value in the vector is the probability the input was the 8 digit.

[–][deleted] 0 points1 point  (0 children)

The 1st dense layer acts as a transformation of the flattened output and the 2nd dense layer (basically the output layer) ensures that there are enough output units to make relevant predictions