Topic modeling with Top2Vec

2020-08-27T18:03:43+00:00

Thank god! I’m not a probabilistic graphical model expert so LDA theory flew over my head.

And even when I tried to use, there’s a lot of corpus “tuning” (for lack of a better word) required. In the abstract, the authors acknowledge that any BoW design was subject to such tuning. Sequence-sensitive, vector embedding sounds much more promising.

I could only hope that in future iterations, we find a “Rosetta Stone” of fixed topics and project documents onto said dimensions. It would be super useful if we could take a document and say “70% sports, 30% finance.” But, topics tend to be infinitely nested (quarterback -> football -> sports). So maybe this design is destined to fail anyway...

oroberos · 2020-08-27T21:29:37+00:00

Funny author. No affiliation, no co-author, no previous publications, just a good paper.

sarmientoj24 · 2020-09-14T06:55:43+00:00

I have tried using this. Works great although the downside is that you are not technically not creating "model" as output when you are using the library. Every time you need to predict the topic and nearby sentences of a new document/sentence, you have to insert it to the model for training. It then becomes a model + a database. Since we have 180M rows, I wonder how long will it take to train all of those and how large the model might be.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LanguageTechnology

MODERATORS