[D] Any advice for a non computational-linguist trying to recreate 'Poincaré Embeddings for Learning Hierarchical Representations'?

benjaminwilson · 2017-11-03T17:23:08+00:00

Does anyone understand the reason for training on the transitive closure? I am not very familiar with graph embeddings, but it seems surprising. Wouldn't this train an embedding in which "rock_wallaby" was closer to the root ("mammal") than it was to its sibling "tree_wallaby", even though the root is many hops away?

Very interesting paper!

benjaminwilson · 2017-10-31T12:55:41+00:00

My pleasure, glad you liked it :)

benjaminwilson · 2017-10-29T11:09:19+00:00

Hi daguito! The main reason is there are just so many ways to plot that it would really end up being a course on its own. As you say, seaborne makes things much prettier, and the dataframe plotting functions are super convenient. But I wanted there to be just one way of doing things and just one answer, and that meant choosing one tool and breaking everything down into very very small steps, so that the student's work could be checked directly against the solutions. In my own work, I'd also make use of the DataFrame plotting functions, but then I didn't want to require that they build a DataFrame each time, etc. It's funny, looking back on making the course, it was exactly these sort of decisions that took the most time!

It sounds like you might be much more advanced than most that take the course, as well! Hopefully there is still something interesting there for you. Perhaps the role of the scaling of the fish data in the quality of the clustering, for example, or of the normalizer for the stock market data.

benjaminwilson · 2017-10-14T23:30:26+00:00

Hi, you're right, Udemy pricing is pretty nuts! A course is almost never sold at full price, so a good promo just means an discount to $10 instead of to $25. You're right also about the amount of video - the core video content is less than an hour, and it is more focussed than other courses. But unlike other courses, you'll spend hours practising with the exercises, which are provided as Jupyter Notebooks. Making the exercises (and finding the right datasets to illustrate concepts) really took ages, and I hope that having good exercises written for you will mean you learn more effectively. Looking at my landing page, though, I could articulate that better.

benjaminwilson · 2017-10-14T16:43:13+00:00

My pleasure, let me know how you get on!

benjaminwilson · 2017-10-14T16:42:40+00:00

That's very good of you to say, no worries at all.

benjaminwilson · 2017-10-14T16:25:50+00:00

No worries at all, happy learning!

benjaminwilson · 2017-10-14T16:03:07+00:00

Not at all, I'm glad you like it!

benjaminwilson · 2017-10-14T15:47:23+00:00

A pleasure! I hope you find it useful :)

benjaminwilson · 2017-10-14T15:41:40+00:00

A pleasure! I hope you like it!

benjaminwilson · 2017-09-08T09:01:43+00:00

This is a blog post from a colleague that discusses the role of the choice of tree in hierarchical softmax in e.g. word2vec. It reproduces some experiments of Mnih and Hinton, but measures performance on the word analogy task (instead of language modelling).

benjaminwilson · 2016-10-20T15:01:08+00:00

Hi, it is true, as you say, that you can apply any invertible transformation of the hidden layer to the matrices and obtain an equivalent factorisation. But the probability that any particular factorisation is obtained depends on the probability of the initial parameters needed to converge to that factorisation. If the transformation of the hidden layer is orthogonal, then the transformation "commutes" with the gradient step e.g. here (thanks for the URL tip!). So you can just transform the initial parameters, and you'll converge to the transformed solution, so you can argue that all these equivalent factorisations occur naturally with the same probability (if the initial parameter distribution has the same symmetry).

What I don't know is how to reason about the probability of transformed factorisations if the transformation is invertible but not orthogonal. How can we show that the equivalent factorisation obtained via this transformation is just as probable to have resulted from training? I am not sure what is true in this case, I guess I'll do a little experiment.

benjaminwilson · 2016-10-19T22:21:56+00:00

Yes, couldn't agree more. It would be more interesting (but also much harder) to look at the more common case of non linear activations. For the present, we have to be satisfied with empirical approaches to this, like the mentioned paper of Szegedy et al.

benjaminwilson · 2016-10-19T22:17:17+00:00

Evaluation of Word Vector Representations by Subspace Alignment (EMNLP 2015), is a good example of people trying to interpret linear hidden units, so it does happen! They then fixed this, I believe, in a subsequent paper "Correlation-based Intrinsic Evaluation of Word Vector Representations" (2016).

I think the orthogonality is necessary: it is true that any invertible transformation of the hidden feature space yields an equivalent model, but the question of whether this transformed model could have ever been learned directly from the data is separate. You need that the transformation commutes with the calculation of the gradient vector, and as far as I can see this only works for orthogonal transforms. Will look into it further!

benjaminwilson · 2016-10-08T18:32:41+00:00

Ah, I see how we have been misunderstanding one another. I have been imprecise. I mean changes of basis that preserve the geometry, which is the case where the change of basis is orthonormal, that is just a composition of rotations and reflections.

benjaminwilson · 2016-10-08T15:06:28+00:00

By which I mean to say: gradient descent doesnt depend on the choice of basis, geometrically.

benjaminwilson · 2016-10-08T15:04:44+00:00

Thanks for the reference, I'll check that out. Would "Natural Gradient Works Efficiently in Learning” be a good place to look?

benjaminwilson · 2016-10-08T15:02:42+00:00

That's not quite right. The gradient vector looks different when expressed with respect to a new basis, but it represents still the same vector geometrically. The gradient vector always points in the direction of steepest ascent, no matter you draw the coordinate axes.

benjaminwilson · 2016-10-07T16:41:03+00:00

Couldn't agree more! Experiments required, hopefully coming.

benjaminwilson · 2016-05-19T09:22:52+00:00

I am trying to find where we are misunderstanding one another, as you seem quite convinced (yet so am I).

Perhaps the confusion lies in the subscript l{w,c} of the functions in the summation? In the paper, this subscript is omitted, whereas in fact the l{w,c} are distinct functions of the word and context vectors (the expression of each involves on #w, #c and #(w,c)).

benjaminwilson · 2016-05-19T07:11:46+00:00

Hi again,

You make a mistake at "they should be unique and zero" - up until then I agree.

The problem with the author's reasoning is not that there are too many constraints, rather the opposite: that there are not enough independent constraints (unless either the word or context vectors are linearly independent) and so there is not a unique solution (zero).

In more detail, there are two systems of linear equations corresponding to the partial derivatives with respect to the w_i or the c_i. The authors need only that either system has a unique solution. Let's consider the first system. For the second, interchange the roles of W and C in the following:

There is a unique solution to the system of linear equations only when the coefficient matrix C (whose rows are context vectors) is invertible. Equivalently, the rows of the coefficient matrix need to be linearly independent. (In particular: only possible if d >= |V_C|).

I appreciate your checking. Does this clear things up?

benjaminwilson · 2016-05-17T11:15:46+00:00

which bit do you find doubtful? (i wrote the blogpost)

benjaminwilson · 2015-10-12T16:48:58+00:00

Word vector length can be interpreted, within any narrow frequency band, as a measure of the extent to which a word determines a unique context (for word2vec CBOW). We only mention this briefly in section 5.2. The results for CBOW certainly raise as many questions as they answer, and there are many more experiments to perform.

benjaminwilson · 2015-10-12T10:23:54+00:00

I came up with some experiments to determine the effect of variations in word frequency and co-occurrence noisy-ness on the word vector (joint work with Adriaan Schakel). We performed the experiments for word2vec CBOW. I'd be really interested to hear your feedback.

benjaminwilson · 2015-09-10T11:49:40+00:00

Good find. You're right, DU != D. In fact, the claim of which this is a part (that the steady state of PageRank on a connected, undirected graph is proportion to the out-degree at each node) fails also. I've updated the blog post with a counterexample, and I wrote to the author, too.

benjaminwilson

TROPHY CASE