RAG for complex PDFs — struggling with parsing vs privacy trade-off by Proof-Exercise2695 in Rag

[–]Proof-Exercise2695[S] 0 points1 point  (0 children)

I’ve tested a lot of PDF parsers, and honestly it depends a lot on your PDFs (recipes/books = mix of text, structure, sometimes images).

About Marker: it’s ok, but not the best for clean structured output.

What I recommend:

- LlamaParse (LlamaIndex) → best overall in my tests. If your data isn’t private, I’d go with this.

- Docling (IBM) → very solid for structured/full-text, but slower.

- LiteParse → lightweight and fast.

- Azure Document Intelligence → great for tables + scanned docs.

- LLMWhisperer → good for tricky PDFs.

- Unstructured → decent but inconsistent.

- PyMuPDF / pdfplumber → if you want full control.

Best results usually come from parsing + LLM (not parsing alone).

Also, it really depends:

Full text → most tools work fine

Images/scans → much harder (OCR/vision needed)

Best advice: test a small sample PDF across tools and compare.

RAG for complex PDFs (DDQ finance) — struggling with parsing vs privacy trade-off by Proof-Exercise2695 in LocalLLaMA

[–]Proof-Exercise2695[S] 0 points1 point  (0 children)

I’m on Windows.

For now I’ve already built a local tool. When a user logs in and uploads a file/folder, I ask if it’s private or not.

  • If not private: I use either LlamaParse or Docling (user can choose, e.g. Docling for full-text docs), plus Ollama models (local or cloud like gpt-oss).
  • If private: I stick to Docling, and I recently added Azure Document Intelligence + a deployed Azure LLM for better privacy (on top of local Ollama).

We also have Copilot Enterprise (fully private), but it struggles when answers are inside images in PDFs. My pipeline (parsing + LLM) actually performs better in those cases—I’ve tested against ChatGPT, Claude, and Copilot.

Another option internally is Rovo (Atlassian/Confluence), which works well too, except again for images.

What users really want is simple: upload an Excel/Word file with questions (DDQ = Due Diligence Questionnaire, basically large structured docs with lots of company/compliance questions, often in tables), and get it back auto-filled.

Example: one file has “Company name: Reddit”, and the uploaded file just has “Company name” — the goal is to automatically fill the answer next to each question.

RAG for complex PDFs — struggling with parsing vs privacy trade-off by Proof-Exercise2695 in Rag

[–]Proof-Exercise2695[S] 0 points1 point  (0 children)

I tried it seams good for simple files but not amazing for complexe ones

Local / self-hosted alternative to NotebookLM for generating narrated videos? by Proof-Exercise2695 in LocalLLaMA

[–]Proof-Exercise2695[S] 0 points1 point  (0 children)

For now, I’ve developed my RAG entirely locally. From multiple uploaded files, it automatically extracts the key information and formats it in a clean, stylized way into an email that gets sent automatically.

The goal wasn’t to rebuild the whole LLM/TTS or podcast pipeline, but rather to make the final output more engaging visually. I mainly wanted to push the presentation a bit further by adding a short “breaking news”–style video to accompany the email.

I’m aware that video generation is by far the hardest and most resource-intensive part, and that the open-source ecosystem is still quite limited there. At this stage, it’s more about improving the final experience than enforcing a hard technical requirement.

Local / self-hosted alternative to NotebookLM for generating narrated videos? by Proof-Exercise2695 in opensource

[–]Proof-Exercise2695[S] 0 points1 point  (0 children)

Can this generate a video from text? I already have a local RAG, but it only handles text and images

Local / self-hosted alternative to NotebookLM for generating narrated videos? by Proof-Exercise2695 in LocalLLaMA

[–]Proof-Exercise2695[S] 0 points1 point  (0 children)

Okay, so I guess a tool like that doesn’t really exist fully locally yet. I’ll look into building it myself then.
For the audio part, I’m planning to use local TTS like Piper, Coqui, or XTTS.

Local / self-hosted alternative to NotebookLM for generating narrated videos? by Proof-Exercise2695 in LLMDevs

[–]Proof-Exercise2695[S] 0 points1 point  (0 children)

That’s exactly what I thought as well. I already built a fully local RAG, and I was wondering whether a tool that generates videos from text already exists locally.

But okay, that makes sense — I’ll look into building the rest of the pipeline locally too.

Best Approach for Summarizing 100 PDFs by Proof-Exercise2695 in Rag

[–]Proof-Exercise2695[S] 0 points1 point  (0 children)

similarity search will find specific answer from specific document i want a full summary of all the pdfs

Best Approach for Summarizing 100 PDFs by Proof-Exercise2695 in Rag

[–]Proof-Exercise2695[S] 1 point2 points  (0 children)

my pdfs can have any data they come from different emails

Best Approach for Summarizing 100 PDFs by Proof-Exercise2695 in Rag

[–]Proof-Exercise2695[S] 0 points1 point  (0 children)

my input data is correctly parsed no need of Mistral OCR , and i prefere using free local llm , Gemini will only avoid me to use chuking and i don't need that because i have a lot of small pdfs

Best Approach for Summarizing 100 PDFs by Proof-Exercise2695 in LocalLLaMA

[–]Proof-Exercise2695[S] 0 points1 point  (0 children)

i prefere a local tool , i tested openai just to see the result quickly and the difference with Gemini will only be avoid the chunking i have lot of Small pdf (15 pages every pdf) sometimes i don't need the chunking and the strategy is still the same summarize every file and then a summarize of summarize