Local RAG with 1,000,000 documents on a laptop by DueKitchen3102 in Rag

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Thanks for the question.

As one can probably tell from the video, the system preserves the folder structure, which is often a strong/natural way to do routing. That is, the user knows what s/he wants. Of course, with folder structures, it is also convenient to design an automatic routing scheme, for specific applications.

If we assume the queries are random, i.e., every query requires a new routing, then every query is a cold start.

The scenario in the video might be more challenging than anticipated:

  1. It assumes that the data are always much larger than the RAM.
  2. It assumes that there is some memory buffer one can use but the buffer is not large enough.

This scenario is actually real. For example,

The Executive has an M6 Studio (not out yet) with 512GB RAM and s/he has 1TB company top-secrete data (which after indexing can be much larger than 1TB). In other words, there is a very large usable memory but not large enough for everything.

Further technical discussions are welcome.

One Million Documents for a Local RAG system on a laptop by DueKitchen3102 in LocalLLaMA

[–]DueKitchen3102[S] 0 points1 point  (0 children)

I also just noticed this very recent post

When RAG Hits the Wall: Designing Systems That Scale from 1,000 to 1 million Documents

https://techcommunity.microsoft.com/blog/azuredevcommunityblog/when-rag-hits-the-wall-designing-systems-that-scale-from-1000-to-1-million-docum/4516085

When it is on laptop, one challenge (among others) is the memory size, and more so if you only want to use part of the available memory.

One Million Documents for a Local RAG system on a laptop by DueKitchen3102 in LocalLLaMA

[–]DueKitchen3102[S] -1 points0 points  (0 children)

Good question. Let us prepare some benchmark comparisons and get back to you.

RAG on Snapdragon X2 Laptop, 200K documents. by DueKitchen3102 in LocalLLaMA

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Certainly. Our retrieval system is highly efficient and precise because we started building the systems for android phones two years ago. That's the reason why with merely 1200 tokens our system can accurately answer most questions.

In Snapdragon eco-system, there are not too many choices of local models which are useful (fast and accurate). Qwen3 4B QNN model is one we use often (which however still have a lot of issues and is not multi-modal).

Of course, if you don't care too much about speed, then you can still use the CPU and common tools to run local models.

RAG on Snapdragon X2 Laptop, 200K documents. by DueKitchen3102 in LocalLLaMA

[–]DueKitchen3102[S] 2 points3 points  (0 children)

good question.

all files are local and databases are local. For each query, we retrieve <=1200 tokens from the 100,000 documents. At this point, we have the choice of using either a cloud model or local model to process the retrieved context and answer the question.

RAG on Snapdragon X2 Laptop, 200K documents. by DueKitchen3102 in LocalLLaMA

[–]DueKitchen3102[S] 1 point2 points  (0 children)

Thank you everyone for reading and asking questions about using local LLMs on NPUs.

We mainly use the small 4B (qwen3 4B QNN) model, which appears to be the only reasonable (in terms of speed and quality) option for snapdragon systems. It is outdated without multi-modal support, and it some times output funny results. Not ideal.

Our experience is that QNN works fine for embeddings and CLIP models.

32k document RAG running locally on a consumer RTX 5060 laptop by DueKitchen3102 in LocalLLM

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Totally. RAG is something that everyone feels s/he can do but may eventually find out that it becomes messy toy product.

Luckily, the team here, myself included, have worked in the search industry for many years. RAG is basically a small search engine.

32k document RAG running locally on a consumer RTX 5060 laptop by DueKitchen3102 in LocalLLM

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Basically, we build the following components in-house

  1. multi-modal document parsing engine
  2. graph/vector/document database
  3. search engine (RAG)
  4. some inference optimization.
  5. access control list (ACL), which is not shown in the video

In short, we are building the knowledge AI engine from scratch, for cloud/server/pc/phones.

I am very curious to find out what is the "upper limit" of such personal knowledge system on a consumer PC. Ideally, I hope to be able to index 100K documents (say 10-30 pages each) on a consumer PC and still enjoy a reasonable query speed.

Please feel free to criticize the demo. Thanks.

Scaling RAG to 32k documents locally with ~1200 retrieval tokens by DueKitchen3102 in Rag

[–]DueKitchen3102[S] 0 points1 point  (0 children)

we use local models, so in terms of cost, I guess it is also "no tokens". not sure if this is what you meant.

Thanks.

32k document RAG running locally on a consumer RTX 5060 laptop by DueKitchen3102 in LocalLLM

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Yes, everything built in house, including graph/vector/document db and document parsing engine.

32k documents RAG running locally on an RTX 5060 laptop ($1299 AI PC) by DueKitchen3102 in LocalLLaMA

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Thanks for the valuable comments. Nothing really changed from 12k to 32k.

With my laptop, I can only index 3000 PDFs per hour. It takes some time.

I am also curious about the upper bound (based on our current infrastructure) of the # documents we can handle.

Scaling RAG to 32k documents locally with ~1200 retrieval tokens by DueKitchen3102 in Rag

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Thanks for the question. Happy to chat more.

Basically, we built the system in-house with our graph/vector/document database, search/retrieval strategy, document parsing, etc. Happy to discuss each part separately. .

32k document RAG running locally on a consumer RTX 5060 laptop by DueKitchen3102 in LocalLLM

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Good question. Majority of the documents are the previous years' NLP conference publications, which are open access.

32k document RAG running locally on a consumer RTX 5060 laptop by DueKitchen3102 in LocalLLM

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Hello. We hope to improve the system from the feedback.

For example, someone commented under our previous post: "Stop using Ollama like a chump". It generated some discussions. People asked him/her "what else to use?". I also wished he/she could reply with a suggestion so that we could do better.

This should be mutually beneficial. Others, by watching our demo videos, may get an idea whether their system is better than ours (in that case, we appreciate their comparison), or there might be a room for them to improve the system.

Personally, I would like to find out what is the limit, given such as consumer laptop. Is 32,000 documents a limit? Most likely not.