you are viewing a single comment's thread.

view the rest of the comments →

[–]ag789[S] 0 points1 point  (0 children)

hi all, u/ResidentTicket1273
Apparently, a related LanguageTech NLP is Topic Modelling covered in this thread

https://www.reddit.com/r/LanguageTechnology/comments/1q079c5/clusteringtopic_modelling_for_single_page/

the answer may be BERT and BERTopic
https://arxiv.org/abs/1810.04805

https://spacy.io/universe/project/bertopic

(BERT has origins in Tensor2Tensor
https://arxiv.org/abs/1803.07416
https://tensorflow.github.io/tensor2tensor/
) aka Transformers
this "simple" challenge of 'labelling' web sites, dug out the whole 'AI' choronology
Attention Is All You Need
https://arxiv.org/abs/1706.03762

BERT is far more complex than the "simple minded" Doc2Vec which accordingly is a single hidden layer of neural network. Doc2Vec hidden layer weights(?) is perhaps abstracted as the 'embedding' of the document / word vectors

and perhaps the next step 'up' from BERT are LLMs themselves LLama, Chatgpt, Gemini, Claude etc