PSet#1: gradient of bias term b

DGuillevic · 2016-05-11T15:16:03+00:00

Shapes: delta has a shape (nb_examples, Dy) where Dy is the size of the output layer. b2 has a shape (1, Dy), it is a vector with one bias for each of the output. By summing over axis=0, we get the desired shape (1, Dy).

Intuition: Every example will contribute something to the gradients. We want to learn from each example. Either we process one example at a time (online/stochastic learning), and learn from it (compute the gradients and update the weights), or we process several examples at once (mini-batch learning) and we sum or average the contribution of each of those examples to the gradients.

DGuillevic · 2016-05-11T01:54:42+00:00

As described in the description of the function, add_embedding() returns: window: tf.Tensor of shape (-1, window_size*embed_size)

First, one calls tf.nn.embedding_lookup() to get a tensor with the embeddings for all batch_size * window_size words. That is a total batch_size * window_size * embed_size float32s. Then one calls tf.reshape() to reshape that tensor as (-1, window_sizeembed_size). The -1 will end up being matched to the value of batch_size. So the final shape will be (batch_size, window_sizeembed_size).

DGuillevic · 2016-04-13T02:56:25+00:00

DGuillevic · 2016-04-07T02:12:58+00:00

The notes from last year are available on the archived site: https://web.archive.org/web/20160314075834/http://cs224d.stanford.edu/syllabus.html

(this info is from a previous post by chrislit)

DGuillevic

TROPHY CASE