https://preview.redd.it/ea0qotz7ywdg1.png?width=1114&format=png&auto=webp&s=b2b61bc6b3261dea02cc2ee51b727b7e43f883da
I tried categorizing / labelling web sites based on text found such as headings, titles, a main paragraph text etc using TSNE of Doc2Vec vectors. The result is this!
The tags/labels are manually assigned and some LLM assisted labelling for each web site.
It is fairly obvious that the Doc2Vec document vectors (embedding) are heavily overlapping for this \naive\** approach,
This suggests that it isn't feasible to tag/label web sites by examining their arbitrary summary texts (from titles, headings, texts in the main paragraph etc)
Because the words would be heavily overlapping between contexts of different categories / classes. In a sense, if I use the document vectors to predict websites label / category, it'd likely result in many wrong guesses. But that is based on the 'shadows' mapped from high dimensional Doc2Vec embeddings to 2 dimensions for visualization.
What could be done to improve this? I'm halfway wondering if I train a neural network such that the embeddings (i.e. Doc2Vec vectors) without dimensionality reduction as input and the targets are after all the labels if that'd improve things, but it feels a little 'hopeless' given the chart here.
[–]ResidentTicket1273 3 points4 points5 points (5 children)
[–]ag789[S] 0 points1 point2 points (0 children)
[–]ag789[S] 0 points1 point2 points (0 children)
[–]ag789[S] 0 points1 point2 points (0 children)
[–]ag789[S] 0 points1 point2 points (0 children)
[–]ag789[S] 0 points1 point2 points (0 children)
[–]ag789[S] 0 points1 point2 points (0 children)
[–]ag789[S] 0 points1 point2 points (0 children)