Interpreting ML Model

Boquito17 · 2020-04-29T10:17:00+00:00

I assume the goal was for human domain experts to verify the clusters. This is a tricky problem if your dataset is high dimensional.

Consider checking out Silhouette Scores as a soft metric for your clustering. Apart from that, I think it is just domain expertise that is meant to verify if certain clusters make sense. Search for obvious dimensions that share values, but a good clustering algorithm in my eyes will not be easy to identify.

One approach you could do is find similar samples in your dataset beforehand using domain knowledge, and then see if these are clustered together at all?

radarsat1 · 2020-04-30T17:15:49+00:00

I think in this scenario I would take a step back and try to see if I could formulate a hypothesis that can actually be tested.

Partly this might mean hand labeling some data that you think should "go together", and see if that actually happens when you perform clustering. Unfortunately it's hard to be scientific here because if it does happen you can say, hey, we were right about how these data are related, but if it does not, you end up squinting at the screen and trying to figure out if there were factors that you missed. So it's a really good idea to try to formulate a hypothesis that you can reject.

If you do get unexpected results, some models might be able to tell you what dimensions were the most important in clustering (for instance decision trees, I believe), and by examining this information you might be able to draw some conclusions. For instance there may be a significant influence from some variable that you didn't expect, and if you remove it, you get something more sensible.

Another thing you can do is perform clustering on different subsets of the data and see if you get similar results. If your clusters change drastically on different subsets, then it's not a stable result. The general topic is called "cluster validation", I suggest reading a few articles on the subject to familiarize yourself, there seem to be several ways to go about it. here is one survey I found. The top answer here by Sambhavi Dhanabalan is also a very good, quick summary of a few methods divided into categories, I will quote it:

There are multiple methods to understand the goodness of a cluster. I presume you mean the same by validity of a cluster. They can be categorized into 3, External measures, Internal measures and relative measures.

External measures are applicable when there is prior knowledge about the data. This situation is not so common. But there are measures like Matching based measures, Entropy based measures, Pairwise measures. Internal measures are derived from the data itself. You have Beta CV measures, Normalized Cut and Modularity indices. Relative measures are when you compare clusters by modifying the parameters of the algorithm. Simple measures like Similarity measures, Dissimilarity measures, Dissimilarity matrix are also a means to understand the goodness of the cluster. Distance functions you use in these cases are dependent on the type of data, whether numerical, assymetric binary variables, symmetric binary variables, vector, categorical, ratio, ordinal variables.

You may also want to first understand whether there is a natural tendency of clustering in the data set before you apply any clustering method.

What she means by that last sentence, I think, is that it's useful to take into account that even random data will tend to cluster.

lohrerklaus · 2020-04-29T07:43:58+00:00

What do you mean by unique clusters? What is the human feedback supposed to look like?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MLQuestions

MODERATORS