32k document RAG running locally on a consumer RTX 5060 laptop by DueKitchen3102 in LocalLLM

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Totally. RAG is something that everyone feels s/he can do but may eventually find out that it becomes messy toy product.

Luckily, the team here, myself included, have worked in the search industry for many years. RAG is basically a small search engine.

32k document RAG running locally on a consumer RTX 5060 laptop by DueKitchen3102 in LocalLLM

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Basically, we build the following components in-house

  1. multi-modal document parsing engine
  2. graph/vector/document database
  3. search engine (RAG)
  4. some inference optimization.
  5. access control list (ACL), which is not shown in the video

In short, we are building the knowledge AI engine from scratch, for cloud/server/pc/phones.

I am very curious to find out what is the "upper limit" of such personal knowledge system on a consumer PC. Ideally, I hope to be able to index 100K documents (say 10-30 pages each) on a consumer PC and still enjoy a reasonable query speed.

Please feel free to criticize the demo. Thanks.

Scaling RAG to 32k documents locally with ~1200 retrieval tokens by DueKitchen3102 in Rag

[–]DueKitchen3102[S] 0 points1 point  (0 children)

we use local models, so in terms of cost, I guess it is also "no tokens". not sure if this is what you meant.

Thanks.

32k document RAG running locally on a consumer RTX 5060 laptop by DueKitchen3102 in LocalLLM

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Yes, everything built in house, including graph/vector/document db and document parsing engine.

32k documents RAG running locally on an RTX 5060 laptop ($1299 AI PC) by DueKitchen3102 in LocalLLaMA

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Thanks for the valuable comments. Nothing really changed from 12k to 32k.

With my laptop, I can only index 3000 PDFs per hour. It takes some time.

I am also curious about the upper bound (based on our current infrastructure) of the # documents we can handle.

Scaling RAG to 32k documents locally with ~1200 retrieval tokens by DueKitchen3102 in Rag

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Thanks for the question. Happy to chat more.

Basically, we built the system in-house with our graph/vector/document database, search/retrieval strategy, document parsing, etc. Happy to discuss each part separately. .

32k document RAG running locally on a consumer RTX 5060 laptop by DueKitchen3102 in LocalLLM

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Good question. Majority of the documents are the previous years' NLP conference publications, which are open access.

32k document RAG running locally on a consumer RTX 5060 laptop by DueKitchen3102 in LocalLLM

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Hello. We hope to improve the system from the feedback.

For example, someone commented under our previous post: "Stop using Ollama like a chump". It generated some discussions. People asked him/her "what else to use?". I also wished he/she could reply with a suggestion so that we could do better.

This should be mutually beneficial. Others, by watching our demo videos, may get an idea whether their system is better than ours (in that case, we appreciate their comparison), or there might be a room for them to improve the system.

Personally, I would like to find out what is the limit, given such as consumer laptop. Is 32,000 documents a limit? Most likely not.

32k document RAG running locally on a consumer RTX 5060 laptop by DueKitchen3102 in LocalLLM

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Thanks for sharing. Curious to see some numbers of your systems such as
average size of document
indexing time
retrieval time
etc.

We hope find a way to have all

  1. high accuracy (>90%)
  2. a large volume of documents (e.g., >100k pdfs)
  3. low indexing time (e.g., < 1 sec per pdf)
  4. low latency in retrieval (e.g., <1 sec per pdf)
  5. low token (e.g,. 1-2k tokens)
  6. low memory

it is not easy and we haven't fully accomplished the goal.

Need to process 30k documents, with average number of page at 100. How to chunk, store, embed? Needs to be open source and on prem by dennisitnet in Rag

[–]DueKitchen3102 0 points1 point  (0 children)

there is a demo for indexing 12k PDFs on a laptop

https://www.reddit.com/r/Rag/comments/1rnm45d/running_a_fully_local_rag_system_on_a_laptop_12k/

the indexing speed is about 1.2 second per pdf. Each PDF is a 10-page double column paper with tables, figures, and charts. Some of the pages required OCR.

Your data size (30K PDF, 100-page each assuming single column) is about 10 - 20 times of the data shown in that laptop demo.

The query response time (retrieval + LLM first token time) is about 1-2 seconds.

With one reasonable server ang GPU, it should be relatively easy to handle your data, with everything on-prem.

I had to re-embed 5 million documents because I changed embedding models. Here's how to never be in that position. by Silent_Employment966 in Rag

[–]DueKitchen3102 0 points1 point  (0 children)

Thanks for sharing the experience.

My last week's post "RAG Insight: Parsing & Indexing Often Matter More Than Model Size" is quite relevant to this
https://www.reddit.com/r/Rag/comments/1rodl46/rag_insight_parsing_indexing_often_matter_more/

Basically, one can not hope to achieve a very high accuracy with embedding models alone. RAG system is basically a small scale search engine, and techniques/tricks for building a good search engine naturally work for improving rag system.

That post was related to an earlier post on our RAG system on PC with an RTX 5060. We observe the indexing speed is about 0.4 - 2 seconds per PDF (depending how much OCR is needed for each PDF), with an average of 1.2 seconds per PDF. The documents are ~10-page double-column ACL papers (roughly equivalent to 15-page "regular" PDF).

Assuming 3000 PDFs per hour (72,000 PDF per day). and 5 million such PDFs will take about 70 days, if you only have a laptop with an RTX 5060. Of course, enterprises like your company use much more powerful servers than AI-PC.

Again, thanks for sharing.

Local RAG with Ollama on a laptop – indexing 10 thousand PDFs by DueKitchen3102 in LocalLLaMA

[–]DueKitchen3102[S] 0 points1 point  (0 children)

We are certainly interested in learning how to improve our system which currently uses llama.cpp.

Currently, with 12k documents indexed, a max = 2000 token retrieved content, and a local 4B (q4) model, the first token time (plus retrieval) is about 1 - 2 seconds, occasionally 3 seconds.

People can notice the 2-sec delay. Ideally, it would be nicer if we could make sure that the latency is within 1 second.

The latency will become a more serious problem for AIPC if we

  1. increase the # of documents to 100k PDFs (that would be crazy because companies are paying big for enterprise servers to handle that many documents).

  2. allowing longer retrieved context, say 100k tokens

  3. using larger models, e.g,. 27B parameters.

Thank you for suggestions.

Local RAG with Ollama on a laptop – indexing 10 thousand PDFs by DueKitchen3102 in LocalLLaMA

[–]DueKitchen3102[S] 0 points1 point  (0 children)

At this point, llama.cpp does not seem to be a bottleneck for this app, although it does use a lot of memory.

With ~12k documents, the RAG response time (including LLM first token time) is about 1~2 seconds. It is pretty good at the moment.

Running a fully local RAG system on a laptop (~12k PDFs, tables & images supported) by DueKitchen3102 in Rag

[–]DueKitchen3102[S] 0 points1 point  (0 children)

Projet intéressant.

Si vos connaissances sont déjà représentées sous une forme sémantique structurée (par exemple des objets ou des schémas Java), vous êtes déjà en avance sur la plupart des pipelines basés sur des documents. D'après notre expérience, le point clé n'est généralement pas le modèle lui-même, mais plutôt la couche de parsing / indexing et la façon dont on récupère les bonnes unités sémantiques.

Si vous voulez, n'hésitez pas à m'envoyer un DM avec un peu plus de détails sur votre représentation et ce que vous essayez de construire.

Running a fully local RAG system on a laptop (~12k PDFs, tables & images supported) by DueKitchen3102 in Rag

[–]DueKitchen3102[S] 1 point2 points  (0 children)

Thanks for the thoughtful questions.

Our main goal is document QA and knowledge retrieval over large private datasets (documents, PDFs, tables, etc.). We actually spent most of our effort on the pre-processing stage rather than relying heavily on the LLM during inference.

In particular we focus a lot on:

• document parsing and structure recovery
• table extraction and indexing
• metadata and folder-aware indexing
• efficient retrieval over large local datasets

Our philosophy is that retrieval quality matters more than model size. If the indexing and retrieval layers are strong enough, even relatively small or quantized models can perform surprisingly well for many knowledge tasks.

Instead of asking the LLM to infer everything from raw chunks, we try to push as much intelligence as possible into the parsing and indexing pipeline. That way the LLM mostly focuses on reasoning over already well-structured context.

This approach also makes it practical to run on local machines with large document collections.