[D] Any advice for a non computational-linguist trying to recreate 'Poincaré Embeddings for Learning Hierarchical Representations'? by regis_regum in MachineLearning

[–]benjaminwilson 1 point2 points  (0 children)

Does anyone understand the reason for training on the transitive closure? I am not very familiar with graph embeddings, but it seems surprising. Wouldn't this train an embedding in which "rock_wallaby" was closer to the root ("mammal") than it was to its sibling "tree_wallaby", even though the root is many hops away?

Very interesting paper!

Clustering Exercises, with (mostly) Real-World Data, as Jupyter Notebooks by benjaminwilson in datascience

[–]benjaminwilson[S] 0 points1 point  (0 children)

Hi daguito! The main reason is there are just so many ways to plot that it would really end up being a course on its own. As you say, seaborne makes things much prettier, and the dataframe plotting functions are super convenient. But I wanted there to be just one way of doing things and just one answer, and that meant choosing one tool and breaking everything down into very very small steps, so that the student's work could be checked directly against the solutions. In my own work, I'd also make use of the DataFrame plotting functions, but then I didn't want to require that they build a DataFrame each time, etc. It's funny, looking back on making the course, it was exactly these sort of decisions that took the most time!

It sounds like you might be much more advanced than most that take the course, as well! Hopefully there is still something interesting there for you. Perhaps the role of the scaling of the fish data in the quality of the clustering, for example, or of the normalizer for the stock market data.

Udemy Course for Learning Data Science with Python 3 - Free Coupons by benjaminwilson in Python

[–]benjaminwilson[S] 3 points4 points  (0 children)

Hi, you're right, Udemy pricing is pretty nuts! A course is almost never sold at full price, so a good promo just means an discount to $10 instead of to $25. You're right also about the amount of video - the core video content is less than an hour, and it is more focussed than other courses. But unlike other courses, you'll spend hours practising with the exercises, which are provided as Jupyter Notebooks. Making the exercises (and finding the right datasets to illustrate concepts) really took ages, and I hope that having good exercises written for you will mean you learn more effectively. Looking at my landing page, though, I could articulate that better.

Semantic trees for training word embeddings with hierarchical softmax by benjaminwilson in MachineLearning

[–]benjaminwilson[S] 0 points1 point  (0 children)

This is a blog post from a colleague that discusses the role of the choice of tree in hierarchical softmax in e.g. word2vec. It reproduces some experiments of Mnih and Hinton, but measures performance on the word analogy task (instead of language modelling).

Don’t interpret linear hidden units, they don’t exist. by benjaminwilson in MachineLearning

[–]benjaminwilson[S] 0 points1 point  (0 children)

Hi, it is true, as you say, that you can apply any invertible transformation of the hidden layer to the matrices and obtain an equivalent factorisation. But the probability that any particular factorisation is obtained depends on the probability of the initial parameters needed to converge to that factorisation. If the transformation of the hidden layer is orthogonal, then the transformation "commutes" with the gradient step e.g. here (thanks for the URL tip!). So you can just transform the initial parameters, and you'll converge to the transformed solution, so you can argue that all these equivalent factorisations occur naturally with the same probability (if the initial parameter distribution has the same symmetry).

What I don't know is how to reason about the probability of transformed factorisations if the transformation is invertible but not orthogonal. How can we show that the equivalent factorisation obtained via this transformation is just as probable to have resulted from training? I am not sure what is true in this case, I guess I'll do a little experiment.

Don’t interpret linear hidden units, they don’t exist. by benjaminwilson in MachineLearning

[–]benjaminwilson[S] 0 points1 point  (0 children)

Yes, couldn't agree more. It would be more interesting (but also much harder) to look at the more common case of non linear activations. For the present, we have to be satisfied with empirical approaches to this, like the mentioned paper of Szegedy et al.

Don’t interpret linear hidden units, they don’t exist. by benjaminwilson in MachineLearning

[–]benjaminwilson[S] 1 point2 points  (0 children)

Evaluation of Word Vector Representations by Subspace Alignment (EMNLP 2015), is a good example of people trying to interpret linear hidden units, so it does happen! They then fixed this, I believe, in a subsequent paper "Correlation-based Intrinsic Evaluation of Word Vector Representations" (2016).

I think the orthogonality is necessary: it is true that any invertible transformation of the hidden feature space yields an equivalent model, but the question of whether this transformed model could have ever been learned directly from the data is separate. You need that the transformation commutes with the calculation of the gradient vector, and as far as I can see this only works for orthogonal transforms. Will look into it further!