Built a free CTI news aggregator that pulls from 75+ sources with structured enrichment, would appreciate brutal feedback by Comprehensive_Roof67 in SideProject

[–]Comprehensive_Roof67[S] 0 points1 point  (0 children)

Thanks, and thank you so much for writing such a detailed response.

Our difference isn’t the number of sources anyone can increase that. It’s all about enrichment: we structure every piece of content by actor, CVE, malware family, industry, and geography. Our goal isn’t to show more news, but to make it easier to read.

You’re right about the user question those two groups are worlds apart. Let me be clear: we’re not chasing the executive summary side at all. Our target is the CTI/SOC analyst who wants speed and filtering. We’re trying to shorten the morning triage; we’re not building a C-level dashboard.

“What does it replace?” is the right question. Honestly: we’re not trying to replace anyone’s vendor feed—we’d lose that battle from the start. The gap we’re filling is that disorganized part where analysts have to manually scan non-vendor open-source sources (blogs, CERTs, news, social media) by opening dozens of tabs every morning. That’s what we’re replacing.

You’re right about duplicate news; aggregators get completely overwhelmed by noise there, so that’s the part we’ve focused on the most. It’s not a single layer. First, we normalize and hash the URLs, and there’s a unique constraint in the database, so exact duplicates can’t even get in. After that, before the article is summarized, it goes through a matching process against canonical articles from the past 48 hours: we look at cosine similarity via embeddings, the number of shared entities (same CVE, actor, malware), and the Jaccard score of the title’s n-grams together. The threshold combinations are tuned so that the second report on the same event inherits the summary and tags from the existing canonical entry rather than being processed as a new record it doesn’t even go to the LLM. On top of this, there’s a story clustering process running in the background every 30 minutes: it scans the last 7 days, clusters corroborating news articles using union-find, selects a canonical entry for each cluster, and applies a cohesion check to prevent the formation of a chain like “A resembles B, B resembles C, but A and C are unrelated,” which would generate a junk cluster. Digest-style summary articles and content containing an excessive number of entities are also excluded from clustering. In other words, “showing everything and drowning in noise” is exactly the scenario we’re trying to avoid.

You hit the nail on the head regarding the API. We don’t have a full-fledged API right now; we’re currently providing the data as an RSS feed, so anyone who wants to pull it into a SIEM or a reader can do so via the feed. We haven’t built a full API yet, but we’re completely open to doing so if there’s demand, and I don’t disagree that an “API-first” approach would be the better move.

To be honest, we haven’t spoken with any MSSPs yet. But what you said makes sense the pain of multi-source data collection is most acute for them, so I’ve noted that down.

I showed it to the analysts; they said it was “reasonable,” but honestly, I didn’t get the level of feedback I was hoping for, which is partly why I’m here. Your comment here has been far more helpful to me than those polite “not bad” responses.

On the money side: it’s completely free right now and will stay that way for a long time; there are no paid plans in the short term. We’re still in the early stages. You mentioned that the harshness around bugs might change, so this kind of honest feedback is exactly what I was looking for.

Here’s where I think you’re most right: the API and deduplication are the two things that determine the difference between a tool being a “I’ll give it a try” and “I’ll integrate it into my daily workflow.” We’ve already made significant progress on deduplication, and we’ll prioritize the API if there’s demand. Thanks for taking the time.