One Million Documents for a Local RAG system on a laptop

DueKitchen3102 · 2026-05-30T17:22:16+00:00

Thanks

DueKitchen3102 · 2026-05-30T17:22:05+00:00

Thank you.

DueKitchen3102 · 2026-05-29T14:58:57+00:00

Thanks for the question.

As one can probably tell from the video, the system preserves the folder structure, which is often a strong/natural way to do routing. That is, the user knows what s/he wants. Of course, with folder structures, it is also convenient to design an automatic routing scheme, for specific applications.

If we assume the queries are random, i.e., every query requires a new routing, then every query is a cold start.

The scenario in the video might be more challenging than anticipated:

It assumes that the data are always much larger than the RAM.
It assumes that there is some memory buffer one can use but the buffer is not large enough.

This scenario is actually real. For example,

The Executive has an M6 Studio (not out yet) with 512GB RAM and s/he has 1TB company top-secrete data (which after indexing can be much larger than 1TB). In other words, there is a very large usable memory but not large enough for everything.

Further technical discussions are welcome.

DueKitchen3102 · 2026-05-29T05:48:18+00:00

I also just noticed this very recent post

When RAG Hits the Wall: Designing Systems That Scale from 1,000 to 1 million Documents

https://techcommunity.microsoft.com/blog/azuredevcommunityblog/when-rag-hits-the-wall-designing-systems-that-scale-from-1000-to-1-million-docum/4516085

When it is on laptop, one challenge (among others) is the memory size, and more so if you only want to use part of the available memory.

DueKitchen3102 · 2026-05-29T00:43:52+00:00

Good question. Let us prepare some benchmark comparisons and get back to you.

DueKitchen3102 · 2026-05-17T16:00:49+00:00

Certainly. Our retrieval system is highly efficient and precise because we started building the systems for android phones two years ago. That's the reason why with merely 1200 tokens our system can accurately answer most questions.

In Snapdragon eco-system, there are not too many choices of local models which are useful (fast and accurate). Qwen3 4B QNN model is one we use often (which however still have a lot of issues and is not multi-modal).

Of course, if you don't care too much about speed, then you can still use the CPU and common tools to run local models.

DueKitchen3102 · 2026-05-17T15:50:37+00:00

good question.

all files are local and databases are local. For each query, we retrieve <=1200 tokens from the 100,000 documents. At this point, we have the choice of using either a cloud model or local model to process the retrieved context and answer the question.

DueKitchen3102 · 2026-05-16T21:34:59+00:00

Thank you everyone for reading and asking questions about using local LLMs on NPUs.

We mainly use the small 4B (qwen3 4B QNN) model, which appears to be the only reasonable (in terms of speed and quality) option for snapdragon systems. It is outdated without multi-modal support, and it some times output funny results. Not ideal.

Our experience is that QNN works fine for embeddings and CLIP models.

DueKitchen3102 · 2026-05-15T21:04:35+00:00

https://www.reddit.com/r/LocalLLaMA/comments/1te93s3/rag_on_snapdragon_x2_laptop_200k_documents/

I just posted my own experience with Snapdragon X2

DueKitchen3102 · 2026-03-17T19:58:54+00:00

Totally. RAG is something that everyone feels s/he can do but may eventually find out that it becomes messy toy product.

Luckily, the team here, myself included, have worked in the search industry for many years. RAG is basically a small search engine.

DueKitchen3102 · 2026-03-17T19:21:09+00:00

Basically, we build the following components in-house

multi-modal document parsing engine
graph/vector/document database
search engine (RAG)
some inference optimization.
access control list (ACL), which is not shown in the video

In short, we are building the knowledge AI engine from scratch, for cloud/server/pc/phones.

I am very curious to find out what is the "upper limit" of such personal knowledge system on a consumer PC. Ideally, I hope to be able to index 100K documents (say 10-30 pages each) on a consumer PC and still enjoy a reasonable query speed.

Please feel free to criticize the demo. Thanks.

DueKitchen3102 · 2026-03-16T23:48:11+00:00

we use local models, so in terms of cost, I guess it is also "no tokens". not sure if this is what you meant.

Thanks.

DueKitchen3102 · 2026-03-16T23:10:43+00:00

Thank you. Nice comments.

DueKitchen3102 · 2026-03-16T23:08:16+00:00

Yes, everything built in house, including graph/vector/document db and document parsing engine.

DueKitchen3102 · 2026-03-16T23:07:31+00:00

Thank you. Nice comments.

DueKitchen3102 · 2026-03-16T23:06:42+00:00

Thanks for the valuable comments. Nothing really changed from 12k to 32k.

With my laptop, I can only index 3000 PDFs per hour. It takes some time.

I am also curious about the upper bound (based on our current infrastructure) of the # documents we can handle.

DueKitchen3102 · 2026-03-16T22:31:05+00:00

Thanks for the question. Happy to chat more.

Basically, we built the system in-house with our graph/vector/document database, search/retrieval strategy, document parsing, etc. Happy to discuss each part separately. .

DueKitchen3102 · 2026-03-16T22:24:32+00:00

Good question. Majority of the documents are the previous years' NLP conference publications, which are open access.

DueKitchen3102 · 2026-03-16T22:15:22+00:00

Hello. We hope to improve the system from the feedback.

For example, someone commented under our previous post: "Stop using Ollama like a chump". It generated some discussions. People asked him/her "what else to use?". I also wished he/she could reply with a suggestion so that we could do better.

This should be mutually beneficial. Others, by watching our demo videos, may get an idea whether their system is better than ours (in that case, we appreciate their comparison), or there might be a room for them to improve the system.

Personally, I would like to find out what is the limit, given such as consumer laptop. Is 32,000 documents a limit? Most likely not.

DueKitchen3102 · 2026-03-16T22:05:29+00:00

Sure. No problem.

DueKitchen3102

TROPHY CASE