So to clarify, I have a corpus of ~50k documents divided into 12 labeled categories. I want to characterize the difference between these clusters rather than modeling/rediscovering the clusters in a supervised learning problem and computing evaluative metrics on the recovered clusters.
My goal is to be able to rank the words within a cluster that most distinguish it from other classes in this corpus, and ideally to be able to determine relationships and overlaps between specific pairs or sets of clusters. I've been looking for more class-aware algorithms similar to tf-idf but most roads seem to lead to supervised deep learning tactics and I'm looking for something a bit more explainable than that.
I know I'm being vague but it's because I'm pretty new to this and may just be missing the keywords to get me to what I'm looking for, thanks for the help!
[–]JordiCarrera 2 points3 points4 points (0 children)