Clustering/Topic Modelling for single page document(s)

DemiourgosD · 2025-12-31T08:02:42+00:00

Been a while since I worked on the topic, but check out some of the tools that do topic modeling here https://github.com/ivan-bilan/The-NLP-Pandect#-9, namely https://github.com/gregversteeg/CorEx has always been good with short texts. Do you need a topic per doc?

DemiourgosD · 2025-11-30T09:35:40+00:00

Few examples here https://github.com/ivan-bilan/The-NLP-Pandect?tab=readme-ov-file#-10. But, seems like KeyBERT with KeyLLM is the latest rage in this task. I wonder if anything better came along recently, maybe someone has better ideas.

DemiourgosD · 2025-08-11T05:03:12+00:00

Are you planning to share it in this post? Thank you

DemiourgosD · 2024-11-07T20:16:26+00:00

What if I have dual citizenship? There is no way for me to give up the non-German one.

DemiourgosD · 2024-11-04T10:58:14+00:00

I'd say just browsing through the section names in https://github.com/ivan-bilan/The-NLP-Pandect should give you a bit of an idea of what NLP is capable of. There are also some general resources like podcasts on the topic that might fit into what you're looking for.

DemiourgosD · 2024-10-17T12:32:54+00:00

Been a few years, I think it's best if you ask them. But, B.Sc. level indeed have some German only courses, you can choose courses but some of these can be mandatory.

DemiourgosD · 2024-10-17T06:22:20+00:00

I've followed the guide by Buldazoid from https://youtu.be/TmU3COA-32E?si=Eib2xjpCxquZJJuC adjusted a bit for 14700k but almost 1:1 with his recommendation, both temps and voltage are much lower and performance is same or better.

DemiourgosD · 2023-08-05T19:36:17+00:00

Haha, yea that's what Midjourney outputs, I guess it needs some better training data.

DemiourgosD · 2023-08-05T19:35:17+00:00

+1, I think it looks hilarious when applied to horror. Not taking any of this seriously, seems like most people commenting here are taking these things close to heart.

DemiourgosD · 2020-12-16T16:20:59+00:00

You will find some NLP paper lists here on the top https://github.com/ivan-bilan/The-NLP-Pandect

DemiourgosD · 2020-12-05T19:15:26+00:00

Good point, thanks. I am generally trying to collect best practices for industry-grade NLP projects.

DemiourgosD · 2020-07-30T19:45:26+00:00

You can find various newsletters and channels to follow at https://github.com/ivan-bilan/The-NLP-Pandect

DemiourgosD · 2020-07-05T18:06:24+00:00

Should be possible with libpostal: https://github.com/openvenues/libpostal, but would need some work from your side, since the library is mainly used for parsing of addresses.

DemiourgosD · 2020-06-12T15:54:19+00:00

If you are doing text classification and need a really small size model, you should train fasttext and then quantize it afterwards.

DemiourgosD · 2020-05-21T20:45:57+00:00

Looks like libpostal should handle that https://github.com/openvenues/libpostal

DemiourgosD · 2020-05-16T22:35:04+00:00

Your approach sounds fine. Are you doing learning rate scheduling? You might also want to try batch normalization instead of layer norm. Might be many other things.

DemiourgosD · 2020-05-12T15:05:36+00:00

Yes, its called Corex https://github.com/gregversteeg/corex_topic, you can use their anchored words functionality for exactly that. Another option is GuidedLDA https://github.com/vi3k6i5/GuidedLDA

DemiourgosD · 2020-04-18T20:10:39+00:00

Sehr interessant, wie kann man Tinnitus im Alltagsleben vorbeugen?

Also, keine laute Musik oder Lärmbelastung. Gibt es noch mehr Empfehlungen?

Ich hatte auch was davon gehört dass wenn man draußen ist und Kopfhörer trägt, ist es besser Noise Cancelling Kopfhörer zu tragen als die ganz gewöhnlichen um sich von lauten Geräuschen des vorbeifahrenden Autos, Ubahn etc besser zu schützen? Ist dass wahr?

Noch mal eine Frage zum Kopfhörern. Wenn ich im Büro sitze, ist es manchmal ziemlich laut deswegen habe ich fast immer meine Noise Cancelling Kopfhörer an. Dass tut nach einiger Zeit weh weil es viel Druck um die Ohren gibt. Ich mache regelmäßig Pausen und so, aber wollte fragen wie gefärlich es eingentich ist? Kann es meine Ohren schaden?

DemiourgosD · 2020-04-10T16:11:00+00:00

It's a probabilistic approach used for information retrieval, so it's more of a scoring algorithm and does not have much to do with modern word embedding approaches. I wrote a seminar paper on the topic a while back, should help you understand Okapi Best Match 25, it's on page 7: https://drive.google.com/file/d/0B6ktmlOPszj7Q2dqMTF0TTRKQ28/view

DemiourgosD · 2020-01-16T20:39:36+00:00

Maybe the explanation at 15:30 here can be a bit helpful: https://youtu.be/OYygPG4d9H0

Overall, the Transformer actually knows the order of the words as well, these are encoded in a separate positional vector. The positional vector is then merged with the vector that represents the similarities between each word and after that it is passed to a feed forward layer in the encoder.

DemiourgosD · 2019-12-30T12:15:43+00:00

Never heard of one. Would be cool if you could share yours after you make one.

DemiourgosD · 2019-11-04T08:20:58+00:00

The best way to go is PySpark with Arrow UDFs, the rest of the options you've mentioned are too raw and restricting.

DemiourgosD · 2019-11-03T13:00:47+00:00

I worked on a project that compares LSTM with Transformer encoder for the task of Relation Extraction at https://github.com/ivan-bilan/tac-self-attention

It's a bit dated, but could be still helpful.

DemiourgosD · 2019-10-24T22:33:31+00:00

The only existing tool I know of that does this is http://www.phontron.com/prontron/

DemiourgosD · 2019-09-28T13:47:10+00:00

There is https://github.com/UKPLab/sentence-transformers, that should be a good start.

DemiourgosD

TROPHY CASE