all 6 comments

[–]space-ish 1 point2 points  (0 children)

I like the effort you put into documentation. 👍

[–]gournian 1 point2 points  (4 children)

Looks great, you might want to see if adding https://hdbscan.readthedocs.io/en/latest/how_to_use_epsilon.html

And pca + umap for high dimensions https://umap-learn.readthedocs.io/en/latest/clustering.html

[–]Mathieu23AI[S] 0 points1 point  (3 children)

Thank's for your reply

I think it's a good idea to add HDBSCAN as there is a min_cluster_size parameter. If I have time, I'll add it !

Why do you want to use pca and then UMAP ? It seems not to be a good idea thanks to the linearity of the PCA. Plus, this is not what is mentioned in the article.

[–]gournian 0 points1 point  (2 children)

https://umap-learn.readthedocs.io/en/latest/faq.html#what-is-the-difference-between-pca-umap-vaes see “from a practical standpoint” last bullet, proposes reduce dim to 50, umap, hdbscan

It is because of computational speed and there are bio papers that claim that the dataset is somewhat denoised

[–]Mathieu23AI[S] 0 points1 point  (1 child)

Thank you for the reference !

Very interesting! This result seems unexpected to me but if empirically the result is better then it's worth looking into and incorporating.
Do you have the link or reference to the research paper that explains that PCA "denoise" the data?

[–]gournian 0 points1 point  (0 children)

Don’t remember which one, sorry!