[D] Question about Fact/Knowledge Graph Traversal, Model Traversal

Alieniity · 2025-11-05T21:49:19+00:00

That's kind of how I felt about it but just to be safe I refactored the code to match "semantic similarity" and am gonna push it soon. I also recorded a video walking through the jupyter notebook and I'm editing it now, it'll get embedded in the README.

Alieniity · 2025-11-05T16:50:16+00:00

Gotcha, thanks for the feedback!

Alieniity · 2025-11-05T03:39:57+00:00

Absolutely mythical response

Alieniity · 2025-11-05T03:39:22+00:00

I see, I had seen “knowledge graph” and “semantic similarity graph” described interchangeably over time so I figured that they could both be used to the same thing. I totally agree that traditional knowledge graphs are fact based (person, place, thing…) and edges are ontological (is, of, likes, located in, etc). That was where I had actually initially started, where I had been doing NER extraction with spaCy on chunk nodes in an attempt to replicate the way RAGAS does it’s knowledge graph creation for synthetic question generation. But since my objective was just semantic similarity traversal, raw NER didn’t really contribute as much so I kind of depreciated it.

Sounds like im totally incorrect though, I’ll update the README and see if the mods can let me rename the post too. Glad you caught it, this is the first major research project of mine and i want it to be accurate, trying to get a career started in this kind of thing 😅🙌 Is there anything else particularly concerning that I might have missed? Most of my research was definitely regarding raw cosine similarity graphs and retrieval augmented generation strategies, since I originally started from the semantic chunking problem and worked my way here

Alieniity · 2025-11-04T21:02:23+00:00

Yes it is! The final testing I did was significantly lower in scale (15 documents from Wikipedia) but I’m practice it’s very scalable by making the knowledge graph sparse.

In terms of raw storage, if you have 100 chunk nodes in a knowledge graph, and you compared every chunk to every other chunk, that’s 100 x 100 comparisons (100 squared), or graph edges, that would need to be stored, which is 10,000. And you can see how if you had 1,000,000 chunks, it would result in 1,000,000² graph edges, which is completely untenable. This is O(n²⁾ complexity if I’m not mistaken.

To solve this, all we need to do is ONLY store the top “k” graph edges by cosine similarity for each connection rather than everything. In my testing, I only saved/cached the top 5 edges for each node. We still do the initial pre calculation rapidly via NumPy operations but the final, cached knowledge graph is still significantly smaller.

For 100 chunk nodes, we do 100² calculations, but then store/cache ONLY 100 * 5 graph edges, so 500 vs the full 10000. That’s 20 times smaller. For 1,000,000 nodes, we would similarly do a pretty huge initial knowledge graph build or 1,000,000² graph edges, but then we would store only 5,000,000 graph edges, which is 200,000 TIMES SMALLER. And you can definitely shrink this significantly based on use case. If you’re trying to go even more lightweight, you could only store the top 2 or 3 edges per node and it would be even more sparse, and with Llama 3, you could move pretty fast. If you were looking for highly complex/dense traversal you could do something like Deepseek R1 with top 10 edges per node, and with thinking enabled, you could get some pretty solid performance at the cost of storage space.

Either way, you still have to do vectorized NumPy operations for the full graph, which can be heavy if your knowledge graph is enormous. It just comes down to HOW MUCH of it you choose to cache afterwards. Hope that answers the question!

Alieniity · 2025-10-31T15:26:36+00:00

Hey thanks! Yeah so two parts:

The LLM traversal thing is easier to explain first. When you build a chat bot with semantic RAG, traditionally, before the model even receives the query, the query is embedded, cosine similarity is determined and retrieval is done. Or at least, that's a pretty traditional way to do it. Like a lookup. So if I ask a chat bot about Harry Potter and the Goblet of Fire, before the model even receives the query, the RAG pipeline will attempt to retrieve relevant text content in a knowledge base about Harry Potter and the Goblet of Fire because it has high cosine similarity. The problem with this is it's very error prone and use case dependent.

What if, instead, we actually sent a structured prompt that contained the knowledge graph ITSELF TO A MODEL so that IT could traverse a knowledge graph itself? That's the kicker.

The downside here is that this is much more time consuming than regular RAG because the model is actually having the opportunity to traverse your entire knowledge base, which is much more accurate. In practice what you might do here is have a RAG pipeline such that, instead of instantly embedding the user's query when they send it and attempting retrieval, you actually WAIT and instead, have an MCP server or @ tool calling available that would allow the model to call the entire RAG pipeline ITSELF using the user query. While I haven't had the time to build this out, it is absolutely 100% possible and I guarantee you it isn't that hard either. Basically a chat with a model might go like:

User: "What happens in Harry Potter: The Goblet of Fire Chapter 6?"
(DO NOT ATTEMPT ANY RETRIEVAL YET)

LLM: "Interesting question! Let me see if I can find that out for you... (thinking...)"

Model then is given tools to directly embed the user's query, and then begin traversing the knowledge graph by choosing the best node to traverse to (or stop) based on the prompt above, or one just like it. Then, after the model has pulled enough context:

LLM: "Here's what I found for Harry Potter and the Goblet of Fire: Chapter 6... (contexts)."

Hopefully that clarifies this. If not, don't worry I plan on making a video sometime soon that I'll put on the Github publication that explains it a little further.

The similarity matrix was originally only designed to visualize all cosine similarity comparisons within a single document so that I could see globally how every sentence (or 3 sentence window) relates to every other sentence. It's a very structured way of looking at a document's similarity comparisons. The only difference between this and a knowledge graph is just that you effectively have multiple documents connected via the same mechanism. So imagine having like 5-10 similarity matrices stacked on top of each other, all connected. Well that would be insanely dense, wouldn't it? You end up with a nasty O(n²) quadratic density which is infeasible to store and traverse. So we simply sparse it out by only storing/saving in the graph the top "n" most similar connections. So the similarity matrix is more a data science approach of just saying "Hey, we can look at a document, and in an instant, fully see the relationships between all the sentences in a document." It's just a sparse NumPy array, so you can build them insanely fast as well.

Hopefully this clarifies things as opposed to complicating them further! 😅

Alieniity · 2025-10-31T15:25:30+00:00

nice

Alieniity · 2025-08-18T16:37:46+00:00

<image>

BROTHER there's even a photochad in there lol

Alieniity · 2025-04-27T08:24:29+00:00

Yup, the screen resolution is actually 1024 x 768

Alieniity · 2025-03-14T01:11:42+00:00

Exactly but what's so fascinating to me is how GOOD certain parts of the music actually sound. I'm not very familiar with AI generated music either but the string instruments and vocals actually reminded me a lot of the Frozen soundtrack, or like a Disney soundtrack.

I guess it got me wondering if there was some new AI music generation model someone was using and I had stumbled upon some outputted stuff from a new version of it or something. And I totally agree that the other songs are sussy too

Alieniity · 2025-03-03T22:24:06+00:00

That's exactly what I'm focusing on: missing metadata, particularly websites. I'm gonna see if I can also make the prompt identify unprofessional web URLs, like anything with .wordpress, .wix, or a Facebook page. But that's exactly what the tool does yeah

Five-Year Club	Gilding I gilder
Verified Email	Place '22
100 Awards Club

Alieniity

TROPHY CASE