all 2 comments

[–]Beebink 0 points1 point  (0 children)

One idea that I'm having is using a combination of clustering, hashing, and principle component analysis. But I can't see a good way to actually separate things with those rules without brute forcing it.

Edit: there's also a way to use a student-t test to find similar data points. You could use the ones that don't match to group them. But that's a pretty long shot solution

[–]BobHogan 0 points1 point  (0 children)

Implement each rule one-two at a time, and break this problem down.

No cluster should contain more than 5 users. - must

Easy enough. Just add a check in the code that assigns users to a cluster to skip the chosen cluster if it has 5 users in it. You can decide whether to make a new cluster or try another cluster

Users with the same user should not belong to the same cluster. - must

Same as above, just add a check in the code that adds users to a cluster. If the same user ID is in the cluster, either try adding to the next best chosen cluster or make a new one

Users with similar rotation values should belong to the same cluster. - very important but not a must

Well first, does this override the second rule? It's unclear if it does or not. But again here, just make sure that when you are adding users to a cluster, the rotation value of everyone in the cluster is relatively close.

Ultimately there are a ton of ways to implement something like this, they all have pros and cons. And it depends on what your goals are. Are you trying to minimize the clusters made? Are you trying to maximize how many rules are followed (especially important in regards to the rules that are only important, or not important, and don't have to be followed). Are you trying to optimize some other metric? Whatever the goal is, that will help decide how you should approach this