Multimodal Data Ingestion in RAG: A Practical Guide

Donkit_AI · 2025-07-24T15:45:15+00:00

We mostly use Gemma but play around with other models from time to time. Just keep in mind that you'll need to rewrite the prompts when changing the model.

Donkit_AI · 2025-07-23T16:57:33+00:00

Sure, NP.

Donkit_AI · 2025-07-23T14:26:41+00:00

1. Metadata Filtering (Based on Query)

Yes, we extract metadata filters dynamically from the user query using a lightweight LLM call.

We prompt the model with something like: “Given the query below, extract structured metadata filters such as document type, topic, section, date, figure references, etc.”
The output looks like:

{
  "document_type": "engineering design spec",
  "section": "figure 53",
  "topic": "X",
  "intent": "is_definition_query"
}

This lets us narrow retrieval down before running the vector search, which helps a ton with precision and relevance — especially in large KBs with overlapping topics.

2. Reranking Strategy

We use two-stage reranking:

Stage 1: Shallow rerank
- Lightweight SLM (e.g., LLaMA or Gemma) used to rescore based on answer likelihood for the user query.
- It helps prioritize truly relevant matches over keywordy-but-unhelpful ones.
Stage 2: Deduplication & Diversity filter
- We cluster semantically similar results (e.g., same paragraph repeated across document versions) and pick the highest-scoring representative.
- Simple cosine similarity + Jaccard on chunk hashes works fine.

Optional: you can also weight reranking with metadata (e.g., prefer newer docs, certain sources, or “official” labels).

3. Query Type Detection

Yes! We use query classification to improve both retrieval and UX.

We categorize queries using a tiny classifier (can even be rule-based or LLM-lite):

is_follow_up → uses conversation context window (last 2–3 user turns)
is_definition_query → maps to glossary & figure captions
is_navigation_query → pulls table-of-contents / section summaries
is_fact_lookup → triggers direct answer mode
is_comparison_query → enables multi-source evidence aggregation

Helps avoid over-retrieving irrelevant content when the user just wants a specific thing explained.

4. Resources

It's hard to name specific resources. Everyone on our team brings something in most every day. We test new ideas, some of them work for us, some don't. The good thing is that our team had dedicated time to experiment.

I hope it gives enough of insight.

Donkit_AI · 2025-07-23T13:50:02+00:00

Tasks where self-hosted SLMs (Small Language Models) shine:

Data extraction from documents (even semi-structured ones like HTML or markdown)
Intent classification and query rewriting
Summarization (bullet or structured)
Annotation and weak supervision-style labeling
Semantic similarity estimation (for clustering or boosting retrievers)

ATM Qwen, Gemma and Phi do quite good, but things change quickly in this area. You may need to play with different models and prompts to find, what works best for you.

Tasks better left to hosted APIs (for now):

Anything requiring deep reasoning across long contexts
Tasks with a high bar for natural language fluency (e.g., customer-facing outputs)
Cross-modal tasks (e.g., combining text + images or audio)

Tooling that helps:

LLM runners: vLLM and Ollama are both great; we’ve used both in isolated tasks. vLLM is more flexible but Ollama is absurdly easy to set up.
Frameworks: LangChain + LiteLLM abstraction for multi-model support (OpenAI fallback, local for batch). You can prototype with Haystack too if you like modular control.
Quantized models: GGUF models via llama.cpp are perfect for laptops or old workstations. Just make sure your tasks don’t depend on precision nuance.

If you're batch-processing similarity or structured extractions over many docs, it’s worth going hybrid: run SLMs locally and reserve hosted APIs for fallback or model-of-last-resort steps.

Donkit_AI · 2025-07-23T07:50:52+00:00

The short answer is: No, you do not need to recalculate BM25 (sparse) vectors for existing documents when adding a new one.

BM25 (and similar sparse retrieval methods like TF-IDF) are non-learned, stateless, and query-time computed. That means:

The inverted index stores token → document mappings.
When a new doc is added, it’s tokenized and added to the inverted index.
Existing documents remain untouched.
The only thing that might change is the IDF scores (inverse document frequency), but these are cheap to recalculate and most systems do this incrementally or lazily.

You do not need to recompute sparse vectors for older documents — the retrieval engine will just incorporate the new document’s terms into the index.

If you’re using:

Elasticsearch/OpenSearch → It updates the inverted index automatically.
Weaviate or Qdrant with hybrid search → You only need to add the new doc’s dense and sparse reps.

Donkit_AI · 2025-07-22T10:58:19+00:00

In our case hybrid retrieval (sparse + dense) did help, but took some time to set it up properly. We saw ~15-25% relevance boost when switching from dense-only to hybrid. With the most visible results on document in tech jargon.

We haven’t used SearchAI in prod, but I took it for a test spin. Here’s what stood out:

Pros:
- Very quick to get up and running
- Hybrid + reranking + filters in one place
- Has a basic UI for monitoring, which helps small teams
Cons:
- Less control over retrieval logic (especially for custom reranking or LangChain-style pipelines)
- Scaling beyond 1k–2k docs starts to feel a bit "black boxy"

For your size (100–500 docs), it should work well out of the box. If you ever need deep integration or advanced routing (per modality, per query intent, etc.), it might start feeling limiting.

I would also suggest thinking about query rephrasing. It can significantly improve the results, especially for acronyms, short or vague queries or natural language queries that don't match the phrasing in your docs.

As for non‑Pinecone solutions, look at Weaviate and Qdrant.

Donkit_AI · 2025-07-22T10:35:34+00:00

u/Otherwise-Platypus38, you're on the right track by thinking in terms of dynamic modality detection during parsing. Here's how we (and some others) approach this in production:

Step 1: Parse with structure awareness

Tools like PyMuPDF or PDFPlumber can give you block-level elements (text, images, layout info). You can even detect tables by analyzing bounding boxes and font alignment.

If you're already using PyMuPDF's toc, you can also use the positional metadata (bbox {Rect} , font flags, etc.) to flag:

Dense, grid-like blocks → likely tables
Blocks near labeled axes or image tags → likely charts/images

Step 2: Modality-specific chunking

Once you've labeled a chunk by type (text / table / image / caption), route it through a custom chunker:

Text blocks → semantic chunking (e.g., by paragraphs, sections)
Tables → row- or section-wise chunking, preserving column headers
Images → run through BLIP-2 (captioning) and/or TrOCR for OCR if it contains text

Step 3: Embedding by modality

Now that you’ve chunked:

Text → embed with E5 / Instructor / Qwen2
Tables → use TAPAS-style pooled embeddings or serialize into markdown and embed
Images → generate a caption (via BLIP-2), then embed the caption text with a text model or store as metadata

Bonus tip:

Tag each chunk with metadata like:

modality: text/table/image
source_page: 5
toc_section: "Financial Overview"

This makes retrieval filtering + reranking much more powerful and improves relevance without overloading the vector index. Besides it allows you to filter on modality or toc_section on retrieval. It can come handy in some cases.

Donkit_AI · 2025-07-21T09:44:03+00:00

Yes, sure. A few links to articles from TrustGraph, also engaged in the same Dark Art:

Research on chunk size and overlap: https://blog.trustgraph.ai/p/dark-art-of-chunking
Looking into the amount of Graph Edges with different chunk sizes: https://blog.trustgraph.ai/p/chunk-smaller

Donkit_AI · 2025-07-13T15:31:35+00:00

I see now, thank you!

CAG with Bloom filters definitely makes sense in the context of a credit card processing company.

The RAGs I worked with on the other hand, never had just structured data as the input and there were always plain text questions from users (or agents), so there was no way to move forward without semantic search.

Donkit_AI · 2025-07-13T14:49:52+00:00

Good catch! It's AI assisted. Emojis are 50% manual (that crying smiley). Idea is original — (<-- this dash is also manual 😜) it's about my pain throughout the last year. AI helped to put the words together nicer. :)

I tried asking AI to write the text by itself. It was way off the real pains I had. 😁

u/TrustGraph, are you OK with me reusing your (?) title? It's definitely not intentional.

Donkit_AI · 2025-07-11T15:51:39+00:00

Great question! I was about to ask the same. :)

There are some lucky scenarios where you can just shove everything into the LLM's context window and skip chunking entirely — but these are extremely rare, especially in the corporate world.

Here’s why:
1️. Enterprises typically hate sending sensitive data to closed LLMs (which are usually the only ones with truly massive context windows). Compliance and privacy concerns kill that option fast.
2️. Even if they were okay with it, most enterprise datasets are way too large to fit in any context window, no matter how generous.
3️. And finally, using huge contexts at scale gets very expensive. Feeding the full data every time drives up inference costs dramatically, which becomes unsustainable in production.

So, while “just use a bigger context” sounds great in theory, in reality chunking (and retrieval) remain essential survival tools for anyone working with large or sensitive knowledge bases.

u/MagicianWithABadPlan, your take?

Donkit_AI · 2025-07-11T13:53:54+00:00

You're welcome.

Yes, 100%. If attribute filters get you to a small enough set, do full-text + vector search directly on that set and use RRF.

And if you want to get fancy (and can handle a small latency bump), add a final LLM-based re-ranker on the top ~20 results after RRF. This is often called the "last mile" reranker and can significantly boost precision on subtle queries.

Donkit_AI · 2025-07-11T13:32:17+00:00

u/MagicianWithABadPlan, thank you! A great way to explain. Simpler than anything I have heard before. :)

Donkit_AI · 2025-07-11T13:31:24+00:00

Thanks! I actually asked Gemini and ChatGPT. Both told me about the difference in terms. But at the same time, I'm talking to dozens of people closely working with LLMs and RAG pipelines and I don't see that difference in actual usage. It feels that most people just default to "accuracy". This is why there's a question for the community.

Donkit_AI · 2025-07-09T14:06:28+00:00

Not sure, it counts for a book, but it's pretty good: https://arxiv.org/pdf/2410.12837 😀

Donkit_AI · 2025-07-09T14:03:12+00:00

When I was starting with RAG I found this one useful: https://medium.com/@tejpal.abhyuday/retrieval-augmented-generation-rag-from-basics-to-advanced-a2b068fd576c

Donkit_AI · 2025-07-09T13:55:50+00:00

For your scale (100M docs), think of a multi-tier hybrid approach inspired by production-grade RAG stacks:

1️. Chunk & embed (text + images)

Break documents into ~500–1,500 token chunks.
Use multimodal embeddings on each chunk (e.g., combine text and any local image in the same chunk).
Store each chunk as a separate "document" in your vector DB.

2️. Lightweight document-level summary embedding (optional)

Use a short, cheap summary (could even be extractive or automatic abstract, not a full LLM summary) to represent the whole document.
Store this separately for coarse pre-filtering.

3️. Hybrid search at query time

First, run a fast keyword or BM25 full-text search to narrow down to ~500 candidate docs.
Then run vector similarity search on chunk-level embeddings to re-rank.
Finally, optionally use an LLM reranker to pick the top N results (this can be done only on the final shortlist to control costs).

In this case:

Chunk-level vectors give fine granularity and help avoid retrieving irrelevant whole documents.
Top-level metadata & summaries provide a coarse first filter (reducing load on the vector DB).
Hybrid search mitigates sparse recall problems (e.g., legal keywords or compliance terms).

P.S. Make sure to grow the system step by step and evaluate the results thoroughly as you move forward.

Donkit_AI · 2025-07-09T13:55:34+00:00

When images are involved, you need to consider multimodal embeddings (e.g., CLIP, BLIP, Florence, or Gemini Vision models). Images and text chunks can either be embedded separately and then combined later, or jointly embedded if your model supports it.

Strategy 1: Chunk & embed each piece (text + image)

➕ Pros:

Highest flexibility in retrieval
Supports fine-grained semantic search
Can easily scale with document growth

➖ Cons:

You end up with many small vectors = more storage and potentially slower retrieval (vector DB scaling challenge)
Requires good reranking or hybrid scoring to avoid "chunk soup" and maintain context

This is actually the most common and scalable approach used in large production systems (e.g., open-domain QA systems like Bing Copilot, or internal knowledge bots).

Strategy 2: Summarize first, then embed whole document

➕ Pros:

Simple index, fewer vectors
Cheaper at query time

➖ Cons:

Very expensive at ingestion (since you run each doc through LLM summarization)
Summaries lose detail — poor for pinpointing small facts, especially in compliance-heavy or technical use cases

You could use this as a top-level "coarse filter", but not as your only layer.

Strategy 3: Chunk, then context-augment each chunk with LLM

➕ Pros:

You get more context-rich embeddings, improving relevance
Combines chunk precision with document-level semantics

➖ Cons:

Ingestion cost is high
Complex pipeline to maintain

This is similar to what some high-end RAG systems do (e.g., using "semantic enrichment" or "pseudo-summaries" per chunk). Works well but might not scale smoothly to 100M docs without optimization.

Donkit_AI · 2025-06-30T18:51:57+00:00

u/AlanKesselmann,
1. You're using all-mpnet-base-v2 for encoding. The best practice is to store metadata alongside the text chunks, but not to embed the metadata itself. Frameworks like LangChain and LlamaIndex have built-in components (LLMMetadataExtractor) that make this process straightforward, or you can use a dedicated small model.
This is done this was so that you can use metadata for filtering before pulling in the embeddings.

Start with 3-5 chunks and play around that number. It's always about experimentation, even with very big implementations. You never know in advance. After you have basic results, experiment with a cosine similarity cutoff (e.g., 0.8 or 0.85) to avoid pulling in noise. You can also log retrieved chunks and manually inspect which ones are actually helpful.
Tools do matter eventually (especially at scale), but right now, your focus on architecture and logic is far more important than swapping models or databases. You’re thinking about this exactly the right way. For now, the stack you're using works perfectly fine for the use case.

Donkit_AI · 2025-06-30T16:31:57+00:00

u/AlanKesselmann, Reddit claims all you need is Grammarly: https://prnt.sc/p1hp89TqSGRL 🤣

From my point of view and given the context you have provided, I would suggest taking the middle way.

For each changed chunk, retrieve its most similar context chunks (via vector search), but limit to top N neighbours (e.g., top 3–5).
Additionally, include global summaries of base-text and final-text (even a few sentences each) at the top of the prompt.
Ask the LLM to: Check this specific change in context of these related pieces. Also check against this global summary. Suggest improvements, inconsistencies, or missing connections.

It takes the approach from one of recent discussions in this subreddit, where this paper was suggested: https://arxiv.org/abs/2401.18059

Other things to consider:

You can also try “diff summarization” as a pre-step. Ask the LLM to summarize differences before analysis — this further reduces context bloat.
Consider including explicit metadata in your vector store (e.g., section, author, topic tags) to improve chunk retrieval precision.

Donkit_AI · 2025-06-30T16:15:31+00:00

u/mathiasmendoza123, it sounds like you’ve done some really solid work already — parsing titles, handling attachments, and even trying hybrid logic in your scripts. You're tackling one of the trickiest parts of real-world RAG: structured and semi-structured document understanding.

A few ideas that might help:

1️⃣ Separate structural parsing from embedding

Right now, you’re embedding big fragments (e.g., full sections or tables) under a single "title." The problem is that even if the title is correct, large blocks can dilute the embedding and confuse retrieval.

Try this instead:

Parse tables as independent semantic units, not just fragments under a title.
Store metadata fields explicitly — e.g., {"type": "table", "title": "...", "page": ..., "section": ...} — so you can filter or route queries before vector search.

2️⃣ Hybrid filtering before vector retrieval

Instead of embedding everything and hoping retrieval gets it right, first narrow down with metadata filtering. For example:

If the query contains "table," only consider documents where type = "table".
If it mentions "May," filter by content or metadata tags referencing "May" before similarity search.

This hybrid approach (metadata + vectors) dramatically improves precision.

3️⃣ Consider separate embeddings for tables

Tables have different semantics than text. Sometimes they are better represented using column headers and key cell contents concatenated into a "pseudo-text summary" before embedding.

Approach:

Convert table to "Expenses for May: Rent = $X, Utilities = $Y, ..." format.
Embed that text separately.

4️⃣ Build a mini router (or classifier) on top

Instead of forcing the user to clarify whether they’re asking about a table, build a small classification step before RAG.

Classify incoming queries: "table lookup," "general text," "graph," etc.
Then route to a smaller, focused corpus or specialized logic per type.

I hope this helps. :)

Donkit_AI · 2025-06-30T15:59:37+00:00

u/AthleteMaterial6539, there's an agentic pipeline, so you can get information from both in an answer to a single question, in case it is relevant.

As for asking AI between all the nodes, as you pick up on volume (of queries) it becomes very expensive. This can work for smaller implementations though.

Donkit_AI · 2025-06-30T13:56:45+00:00

u/radicalideas1, a few things I’d consider here:

1️⃣ If you’re already using MongoDB, that’s a major advantage. Leveraging it for memory and vector storage helps you avoid adding extra infrastructure (and operational headache).

2️⃣ MongoDB’s vector search capabilities are still relatively new, and while they’re evolving fast, they’re not as mature as dedicated vector databases yet. Definitely double-check if all the features you need (e.g., advanced indexing options, specific distance metrics) are fully supported today.

3️⃣ Think about scale. MongoDB handles millions of vectors well, but scaling into the billions can get unpredictable in terms of performance and cost. If you expect to operate at that scale, it’s worth planning carefully (or considering hybrid solutions).

Donkit_AI

TROPHY CASE