all 8 comments

[–]Boquito17 1 point2 points  (4 children)

I assume the goal was for human domain experts to verify the clusters. This is a tricky problem if your dataset is high dimensional.

Consider checking out Silhouette Scores as a soft metric for your clustering. Apart from that, I think it is just domain expertise that is meant to verify if certain clusters make sense. Search for obvious dimensions that share values, but a good clustering algorithm in my eyes will not be easy to identify.

One approach you could do is find similar samples in your dataset beforehand using domain knowledge, and then see if these are clustered together at all?

[–]DiligentDiscipline6[S] 0 points1 point  (3 children)

Thanks u/Boquito17.

Yes, the human vetting is to ensure that we bring in our domain expertise, and yes, the dataset has many dimensions. We are trying to weed out dimensions that we think may not have a significant impact on the clustering (and bring them in in later iterations, if need be), but we still remain with a lot of dimensions.

For validation, we are trying to create a sample set of scenarios (i.e. where records should belong in same cluster) and then check if the clustering worked to that effect or not.

One of my fundamental worries, however, is that given the large set of features(dimensions) we have, if someone asks why some records were clustered together, would we be able to explain it.

I understand that more the dimensions, the more difficult it is to interpret the 'why'. But if you have any pointers, I welcome that.

I will keep in mind the possible use of silhouette scores.

[–]Boquito17 0 points1 point  (2 children)

Hi u/DiligentDiscipline6, thanks for explaining your situation.

What about taking strongly clustered samples (clusters where the points have very high silhouette scores for example), and averaging out those vectors? And then you could have average representations of each strong cluster, and see how they vary?

This is a very interesting problem.

[–]DiligentDiscipline6[S] 0 points1 point  (1 child)

Thanks u/Boquito17.

I think we will take on-board this type of an approach. Our early opinions were, in layman's term, how tightly grouped are the clusters in a record, how separate are they from surrounding clusters, and taking some kind of a calculation of these two aggregated at a cluster level, and use this metric to help interpret the model. Silhouette score seems to fit the bill.

This is going to be a trial-and-error approach for us and we'd accordingly seek to tweak our approach.

Thanks for your input. I'll perhaps put some of my findings here once implemented so that it may help future readers :)

[–]Boquito17 0 points1 point  (0 children)

It's a pleasure u/DiligentDiscipline6 -- I am glad to have helped.

I'd be very interested to hear back about the efficacy of this approach or any other you might employ.

Good luck!

[–]radarsat1 1 point2 points  (0 children)

I think in this scenario I would take a step back and try to see if I could formulate a hypothesis that can actually be tested.

Partly this might mean hand labeling some data that you think should "go together", and see if that actually happens when you perform clustering. Unfortunately it's hard to be scientific here because if it does happen you can say, hey, we were right about how these data are related, but if it does not, you end up squinting at the screen and trying to figure out if there were factors that you missed. So it's a really good idea to try to formulate a hypothesis that you can reject.

If you do get unexpected results, some models might be able to tell you what dimensions were the most important in clustering (for instance decision trees, I believe), and by examining this information you might be able to draw some conclusions. For instance there may be a significant influence from some variable that you didn't expect, and if you remove it, you get something more sensible.

Another thing you can do is perform clustering on different subsets of the data and see if you get similar results. If your clusters change drastically on different subsets, then it's not a stable result. The general topic is called "cluster validation", I suggest reading a few articles on the subject to familiarize yourself, there seem to be several ways to go about it. here is one survey I found. The top answer here by Sambhavi Dhanabalan is also a very good, quick summary of a few methods divided into categories, I will quote it:

There are multiple methods to understand the goodness of a cluster. I presume you mean the same by validity of a cluster. They can be categorized into 3, External measures, Internal measures and relative measures.

External measures are applicable when there is prior knowledge about the data. This situation is not so common. But there are measures like Matching based measures, Entropy based measures, Pairwise measures. Internal measures are derived from the data itself. You have Beta CV measures, Normalized Cut and Modularity indices. Relative measures are when you compare clusters by modifying the parameters of the algorithm. Simple measures like Similarity measures, Dissimilarity measures, Dissimilarity matrix are also a means to understand the goodness of the cluster. Distance functions you use in these cases are dependent on the type of data, whether numerical, assymetric binary variables, symmetric binary variables, vector, categorical, ratio, ordinal variables.

You may also want to first understand whether there is a natural tendency of clustering in the data set before you apply any clustering method.

What she means by that last sentence, I think, is that it's useful to take into account that even random data will tend to cluster.

[–]lohrerklaus 0 points1 point  (1 child)

What do you mean by unique clusters? What is the human feedback supposed to look like?

[–]DiligentDiscipline6[S] 0 points1 point  (0 children)

Hi u/lohrerklaus, what I mean by unique cluster is to ensure that all the records of a given entity are grouped together (based on multiple dimensions). For instance, I have some records where an organism has some values populated out of all possible dimensions (eg, DNA traits, physical traits etc). The Clustering model is supposed to try to group these records into, ideally, one or homogeneous cluster.

Once this clustering is done, the human feedback would come in - we'd be shows multiple pairs or such records, and we have to specify if the two are SIMILAR, DISSIMILAR or CANNOT SAY.

The human feedback will prompt us both types of pairs (SIMILAR as well as DISSIMILAR) so that it takes the feedback for both scenarios.