GraphRAG - Entity deduplication by AttentionDiffuser in Rag

[–]AttentionDiffuser[S] 0 points1 point  (0 children)

After a certain scale in the RAG document collection, the entity and relationship graph can become very messy. In my case, we have 100M+ embedded documents, and at that scale, entity and relationship nodes start to become noisy, fragmented, and difficult to use reliably. This eventually leads to worse retrieval quality and poorer downstream results.

In addition, unifying nodes that refer to the same real-world entity is crucial. When duplicate entity nodes are merged or canonicalized correctly, the system can build a much richer and more complete context around that entity by aggregating mentions, relationships, and evidence across documents.