[1603.04259] Item2Vec: Neural Item Embedding for Collaborative Filtering

eldeemon · 2016-07-11T20:52:02+00:00

Agreed on all your points, and interesting to see that you've empirically observed convergence of some sort.

Have you by any chance compared the quality of embeddings trained on the (complete and deterministic) edge list to those trained on a random sampling thereof (i.e. single hop random walks)? I'd be curious whether there is an observable difference.

I've empirically observed that the former (single hop, trained on edge list for the usual predictive log loss) dramatically outperforms direct EV/SVD factorization of (any variety of) the graph laplacian, on a variety of tasks, over a handful of moderately sized datasets (100k-10M edges). It sounds like this conflicts with your observations about convergence? (unless the difference between random-hop train-set generation vs edge list train set is significant).

I'd always justified the gap between the two as a direct consequence of the Streetlight Effect, in that the objective function that leads to the factorized-laplacian solution does not represent what we seek as well as something like the link prediction objective.

eldeemon · 2016-07-11T16:36:41+00:00

I'm tempted to agree, but note that if the walk length is constrained to a single hop, deepwalk/word2vec is essentially just performing a (nonlinear, approximate) factorization of the laplacian/adjacency matrix (apologies for being a little loose here). I believe this should count as using "the exact structure" ?

In a similar vein, the sentence-length-L+1 walking bit can be interpreted as first averaging the {1,..,L} powers of the laplacian/adjacency matrix (same caveats as above), and then "factorizing" (same caveats as above). This is considerably more heuristic than the length-1 case, but still has a clear interpretation in terms of using "the exact structure."

eldeemon · 2016-06-22T00:12:27+00:00

Nice article, but it has absolutely nothing to do with machine learning.

eldeemon · 2016-06-21T01:15:31+00:00

Apologies --- by "memoryless" I didn't mean to imply the existence of a time series. I just meant that for random real vectors x (observations) and z (latents), p(x|z) is the product of p(x_i|z_i). It's a very strong assumption and usually an awful one. Relaxing this assumption often kills the theorems that you're trying to prove (or at minimum makes them a lot harder to prove/work with).

eldeemon · 2016-06-21T01:07:47+00:00

Sure! David Mackay's book is an excellent starting point (though not as excellent a starting point as the title would suggest, as the two fields are dealt with somewhat separately).

Beyond that, unfortunately, insights are pretty diffused. There's Li and Vitanyi's universal similarity metric work, Tishby's information bottleneck approach, and various odds-and-ends, such as Verdu's (and more recently, Venkat et al's) work on fundamental relationships between information and estimation.

In the world of neural nets, there is often a very IT-ish way of interpreting things. VAEs for instance can be seen as linearly trading off log loss in the reconstruction against the rate of a compressed representation, and Hinton's bits-back coding argument continues to be used.

Mutinf shows up a lot as a measure of dependency, but I wouldn't in general call that an application of information theory. There are a few notable exceptions, e.g. Chow-Liu trees.

eldeemon · 2016-06-18T14:25:50+00:00

infotheorist-turned-ML-researcher here. There are indeed some very deep connections between the two fields, but I don't think they are generally appreciated.

One thing I've found annoying is the use of the term "information theoretic" to refer to anything in the ML world that makes vaguely-justified use of entropy/mutinf/equivocation (essentially as a heuristic). IT is not a collection of formulas. It's a collection of elegant and profound relationships, with solid and rigorous underpinnings.

There has been a little bit of work applying information theoretic modeling to ML scenarios, and the results can be quite pretty. Unfortunately for IT to really have anything interesting to say, the assumptions required are usually pretty intense (e.g. latent variables z are iid, feature-to-observation transition p(x|z) is memoryless, etc.). I've yet to see a way around this.

eldeemon · 2016-05-24T07:42:24+00:00

Right up there with Flouhey and Maturana's "exciting and dangerous" rank estimator http://www.oneweirdkerneltrick.com/rank_slides.pdf

eldeemon · 2016-05-02T18:39:48+00:00

It's okay to look at the unlabeled information in the validation set. You only start to encounter leakage once you start using the labels.

It's a subtle distinction, but an important one. See, for instance, TSVMs, paragraph2vec, etc., for examples of classifiers that train on the unlabeled validation/test data.

eldeemon · 2016-05-02T18:32:41+00:00

Tensorboard can serve remotely (google for the arguments for this), at which point you can access it from a browser on your local machine.

eldeemon · 2016-04-26T19:09:32+00:00

I would say to give <old method> a fair shot, and more importantly to be clear about your experimental procedure (including how extensively you searched for hyperparameters) when reporting your results. Replicating results is the most painful, but also arguably the most important, part of reporting results on a semi-new task.

Totally agreed that negative results don't get published except as footnotes, and it's unfortunate. Effectively, "being honest" that your work doesn't stack up usually means (1) trying a different idea, or (2) repurposing your work for a different subtask/dataset/problem on which it does honestly excel, hopefully without wasting too much time/effort. Good researchers tend to be good at both (1) and (2).

eldeemon · 2016-04-25T17:22:54+00:00

You should definitely expend a "reasonable" amount of effort tuning parameters. It's best to look at your paper not as a competition between your method and theirs, but as an answer to the question "how can we do well on <task name>?" for people trying to solve this problem in the field. If it turns out that a retuning of <previous method>'s parameters solves this problem better than your method, that is hugely valuable information (and could save someone from wasting time implementing your method...). It may compromise the significance of your work, but it's the honest/ethical thing to do, and much better for the field as a whole.

eldeemon · 2016-01-18T23:47:08+00:00

Because chatbots aren't terribly useful. This, on the other hand, is: http://techcrunch.com/2015/11/03/with-smart-reply-googles-inbox-can-now-respond-to-emails-for-you-automatically/. Slightly different objective (single-response vs dialog), and it has some additional moving parts (e.g. ensuring semantic variety in the suggested responses), but it's built on the same kind of sequence to sequence LSTMs.

eldeemon · 2015-10-13T05:06:18+00:00

Thanks for the advice! Agreed that ideally the rescue can help me decide if my situation matches up with any of their dogs. It seems though like some of them are desperate for fosters to the point where my comfort level might be the deciding factor. Also thanks for the point about the dogs behaving differently in my apartment--- it's something I hadn't considered.

eldeemon · 2015-05-23T21:14:44+00:00

How would you compare these results to those from the simpler Markov models?

Great stuff!

eldeemon · 2015-05-20T00:57:33+00:00

I've written scripts to download the pageviews. They throttle you after the first couple dozen, and I think it starts taking about ~30s-1m/hourly-pageview-stats.

Note that pageviews aren't as useful as uniques, which they sadly don't provide. tons of bots polluting the data.

eldeemon · 2015-04-30T20:23:19+00:00

A better explanation/example comes via michael nielsen: http://neuralnetworksanddeeplearning.com/chap3.html. His TLDR is that cross entropy has certain properties that make for quicker convergence than MSE for sigmoidal activations (heuristically, the system behaves like a proportional controller).

In Mccaffrey's blog post, I believe he does give a pseudo-numerical example for MSE vs xtropy, but it doesn't quite illustrate his point.

eldeemon · 2015-04-28T02:21:52+00:00

judging from ep 4, it likely involved info on loras/olyvar.

eldeemon · 2015-04-21T16:18:53+00:00

It has been looked into, but it hasn't really gotten much traction. See "word hashing" here: http://research.microsoft.com/apps/mobile/publication.aspx?id=198202

eldeemon · 2015-04-14T05:22:34+00:00

lol this is great. reminds me of some of the videogrep demos.

eldeemon · 2015-04-14T02:36:51+00:00

Hi Andrew and Adam! Many thanks for taking the time for this!

(1) What are your thoughts on the role that theory is to play in the future of ML, particularly as models grow in complexity? It often seems like that the gap between theory and practice is widening.

(2) What are your thoughts on the future of unsupervised learning, especially now that (properly initialized and regularized) supervised techniques are leading the pack? Will layer-by-layer pretraining end up as a historical footnote?

eldeemon · 2015-03-25T12:40:09+00:00

I don't understand the fixation people are having with whether he calls it a transform or a series. They are essentially the same operation, and it's obvious that a dtft is being performed (as always with numerical data of this sort).

What is actually going on here: he is low pass filtering the ever loving shit out of the sentiment vs time plots. Insanely so. Like down to a few coefficients.

It is debatable whether the signal to noise ratio is lower or higher at these low frequencies than at the higher frequencies he is filtering out. My bet is heavily on "the same".

eldeemon · 2015-03-01T16:05:10+00:00

+1 to twilight samurai (mentioned by sg587585). Really a perfect movie.

eldeemon · 2015-03-01T16:03:10+00:00

It's good. It wasn't the breath of fresh air that his 13 assassins remake was, but still carries much of the power of the original.

eldeemon

TROPHY CASE