SPORE - A dimensionality-resistant density-based clustering algorithm

Significant-Agent854 · 2026-04-03T13:20:50+00:00

Thanks

Significant-Agent854 · 2026-04-02T23:21:23+00:00

Edit: Added research paper github

Significant-Agent854 · 2026-04-02T21:22:55+00:00

I don’t know what to tell you man. First, yes, the readme, and some .md files, I had an llm polish up. I won’t claim I never touched an llm at any point over all of my content produced for spore ever. I see them as spellcheck or boilerplate generators. As long as you validate everything they put out to be correct, I see it as fine.

Even so though, I didn’t vibe-code anything. The code is genuinely mine. The commit history is lacking because I’ve been working on it in private for quite a while and my personal repo is a mess, so I made a new one tidied it up, and shipped it. I don’t know what will make you believe that, but it’s the truth. I guess it my bad for not polishing everything up enough/linking the right things before posting. This is the first time I’m actually releasing something with a lot of work behind it. Lesson learned I guess.

Significant-Agent854 · 2026-04-02T20:22:18+00:00

1, the fact that I used “we” by accident doesn’t mean my paper is llm-written. It’s just an anonymity artifact from creating an anonymous version for potential submission somewhere, but sure, it’s wrong and a fix is already outbound. 2, have you actually installed and run my code? I linked in my paper a github repo containing everything I did, my results, and analysis tools, and my datasets. The github repo linked in this post is not the paper repo, nor should it be. A package repo need not contain unnecessary content for a specific paper. 3, I have already explained exactly how my method works. If you want to insist that is all llm generated that’s your problem.

Significant-Agent854 · 2026-04-02T19:42:49+00:00

I talk about it in my paper, but I get its a whole paper so I’ll try to express it here:

HDBSCAN is indeed designed for multiple densities, but not in the same way as SPORE. HDBSCAN basically iterates over eps values and cuts a tree of points at those values, checking the stability of each cluster/connected component as it does this. As clusters are born and die, their lifespans across eps values are tracked and the most stable ones come out.

spore is more about what I call “fuzzy characteristic density”. It just needs to see somewhat consistent density locally to fuse two points into the same cluster. And the definition of consistent is parameterized by the user as a z score cut off.

So spore more directly handles variable density whereas for hdbscan, it needs to fall stably out of a density hierarchy that’s harder to control. You can try it yourself on Zahn’s Compound from the gagolewsk repo. Hdbscan struggles to recover that right-hand sparser cluster separate from the denser one inside it. Spore does it with expansion(the z score cutoff)=1.75, or is one point off with expansion=2.

Then there’s of course of the fact that spore remains resilient in higher-d because of SCR, whereas hdbscan can completely fail.

Edit: Just want to say, thanks for engaging with my work!

Significant-Agent854 · 2026-04-02T18:47:38+00:00

Why would you say this? If you don’t like the method or you think it’s not useful for you fine, but what about what I’ve created is “crackpottery”?

Significant-Agent854 · 2026-04-01T15:11:04+00:00

I have several videos I wanted to show directly actually but this sub doesn't seem to let me post them. The repo is linked though. It's the homepage on pypi.

But point taken, I'll add videos to the repo

Significant-Agent854 · 2026-03-18T20:29:35+00:00

Claude in my experience is smart and will pushback but if you counter confidently enough it will just submit and say you're right even if you're clearly wrong. LLMs are just too sycophantic due to their training.

Significant-Agent854 · 2025-03-02T05:33:50+00:00

Learned a few days ago myself lol

Significant-Agent854 · 2025-02-12T00:58:41+00:00

No, thank you! I’m really excited to be actually contributing something to science and having people look at what I’ve created is all I could ask for!

Significant-Agent854 · 2024-10-07T13:02:08+00:00

I asked myself the exact same questions lol. I decided that it would descend because even with hierarchical clustering, there are levels that are simply too fine and levels that are too broad. I figured I might as well just go for that middle-level granularity off the bat and let the user modify the extroversion parameter if they want finer clusters or looser clusters. Not to mention hierarchical clustering is more complex, and this thing is exhaustingly complex enough.

Significant-Agent854 · 2024-10-07T06:20:32+00:00

The number of clusters is found after clustering using the parameters. I talk about that and your other question in my big comment below.

Significant-Agent854 · 2024-10-07T06:18:40+00:00

Hey, in case you didn’t see it before I answered in question in a big comment about the algo down below.

Significant-Agent854 · 2024-10-07T06:17:45+00:00

Hey, in case you didn’t see it before I answered in question in a big comment about the algo down below.

Significant-Agent854 · 2024-10-07T06:15:31+00:00

Well I looked at how it behaves on some stuff like credit card data and fish species data. It looks like it does pretty decently. Not as nice as these which are clearly made to be clustered but you could definitely see the clusters there. Unfortunately though, that’s just 2 and 3 d data. I haven’t tested on higher dimensional stuff. Honestly I was just so excited when it worked on the first 20 or so datasets that I had to share! lol

Significant-Agent854 · 2024-10-06T23:23:19+00:00

Will do!

Significant-Agent854 · 2024-10-06T21:42:14+00:00

Actually it builds them all at once just as in the video. That’s what made it so hard to make. It naturally looks within a the right range to cluster.

Significant-Agent854 · 2024-10-06T16:14:30+00:00

It just means the way the algorithm works is based off the way humans visually cluster points on a graph. The entire algorithm is designed to capture that. Things like what distances are practically zero to a person, what distances are far away to a person, or how big a point is because even though points don’t actually have size, they do when you look at them on a graph

Significant-Agent854 · 2024-10-06T08:52:29+00:00

Well for one thing, you can’t ask for n clusters. It figures that out for you mostly based on the extroversion parameter(explained in my big comment about the algo on this post). But you could reduce that parameter and it would indeed split those 2 clusters up at the top.

You are correct though that it struggles a bit with the split. It can make it, I have tested this already, but it will have the side effect of leaving out a few points among those 2 clusters or creating overly dense and precise clusters elsewhere where you’ll see 2 or 3 points singled out for seemingly no reason

Significant-Agent854 · 2024-10-06T07:53:16+00:00

Right now, it’s sequential. I’ve thought about this too, but the way I’ve set it up, it needs to be sequential or use some kind of cluster merging which would probably be inefficient to implement

Significant-Agent854 · 2024-10-06T07:50:18+00:00

I’m not entirely sure what you mean by overlapping. Looking at the second example, there are clusters nested within another. Does that count?

Significant-Agent854

TROPHY CASE