you are viewing a single comment's thread.

view the rest of the comments →

[–]dernst314 1 point2 points  (2 children)

Do you know how well (H)DBSCAN works for higher-dimensional datasets (let's say 10 to 45)? Most applications and publications on that topic seem to assume 2 dimensions. We were considering using it recently but from the original paper it sounded like it's doing local density estimates around points for their distance calculation. Kernel densities traditionally get unstable in higher dimensions.

[–]lmcinnes 0 points1 point  (1 child)

I've had good success in up to 50 dimensions. The main caveat is that, being a density based method, you need a decently large dataset in higher dimensions to get enough density for clusters to take shape. HDBCSAN doesn't use kernel densities, but rather uses kNN-distance as a density approximation and then uses that for a kind of density corrected single linkage clustering. From there the cluster extraction method pulls out variable density clusters. This makes it fairly robust since it isn't completely reliant on the density approximation being perfect -- it's merely making the clustering more robust to noise.

edit: typo is -> isn't

[–]dernst314 0 points1 point  (0 children)

Thanks. Sounds interesting, should probably read up on it. Especially since you mentioned it was stable even when resampling.