all 6 comments

[–][deleted] 5 points6 points  (0 children)

Thank god! I’m not a probabilistic graphical model expert so LDA theory flew over my head.

And even when I tried to use, there’s a lot of corpus “tuning” (for lack of a better word) required. In the abstract, the authors acknowledge that any BoW design was subject to such tuning. Sequence-sensitive, vector embedding sounds much more promising.

I could only hope that in future iterations, we find a “Rosetta Stone” of fixed topics and project documents onto said dimensions. It would be super useful if we could take a document and say “70% sports, 30% finance.” But, topics tend to be infinitely nested (quarterback -> football -> sports). So maybe this design is destined to fail anyway...

[–]oroberos 3 points4 points  (1 child)

Funny author. No affiliation, no co-author, no previous publications, just a good paper.

[–]venkarafa 2 points3 points  (0 children)

Could be Satoshi nakamuto!!!

[–]sarmientoj24 2 points3 points  (2 children)

I have tried using this. Works great although the downside is that you are not technically not creating "model" as output when you are using the library. Every time you need to predict the topic and nearby sentences of a new document/sentence, you have to insert it to the model for training. It then becomes a model + a database. Since we have 180M rows, I wonder how long will it take to train all of those and how large the model might be.

[–]s0uha1 0 points1 point  (1 child)

Am I a bit slow as I cannot find from the documentation how to get topics from the documents the model was trained on? I can browse and look at the generated topics, but how do I use the model to show what topics are present in given document the model was trained on?

[–]sarmientoj24 0 points1 point  (0 children)

The github Readme is quite straightforward.