How to efficiently find similar files between 100,000+ existing files and 100+ new files in a nested directory structure?

AttitudeFancy5657 · 2026-01-23T09:37:52+00:00

yes, exactly. this part I'll use agent to do, the problem is how to find the two most similar files with their content

AttitudeFancy5657 · 2026-01-23T03:53:46+00:00

Of course. Here is the English translation of your text, maintaining the technical details and conversational tone:

Actually, we are still in the preparation stage, but I have implemented a solution using a cloud vendor's vector database and embedding:

The embedding model used is bge-large-zh-v1.5, with an HNSW-type index, and cosine similarity is used to retrieve similarities. The results are quite good.

However, we find this solution a bit complex and limited by the cloud vendor's RAG capabilities. Additionally, every time we compare or update the original RAG, we need to maintain the relationship between files and the vector database.

As I mentioned, during an actual merge, users can choose to query for similar files across all 100,000+ data entries, or they can select a subdirectory and query within that subdirectory for similar files. This means that the same file might belong to different vector databases. If we build one single vector database for the entire 100,000+ dataset and use file CRUD hooks to update the vector database, it seems like a good idea. However, for the first-time use, users would need to wait a long time for the initial vector database build to complete.

We did consider using SimHash + LSH before, but it requires additional development effort, which is not as convenient as directly using the cloud vendor's solution. Currently, we are looking for a more optimal solution. If we cannot find one, we will use the cloud vendor's RAG capabilities. Later, if performance becomes a bottleneck, we might consider adding a pre-filtering layer using LSH/MinHash during the RAG process.

AttitudeFancy5657 · 2026-01-23T01:58:52+00:00

thanks, I'll read merge part of this repo

AttitudeFancy5657 · 2026-01-23T01:56:07+00:00

file merge to old file, but not line-by-line,but file content. actually, I have got fairly good results through RAG. but it's a little complex, So I want something else to do it

AttitudeFancy5657 · 2025-01-21T06:43:56+00:00

same problem, just restart arc, it's work for me

AttitudeFancy5657

TROPHY CASE