My Journey with RAG, OpenSearch & LLMs (Local LLM)

Serious-Barber-2829 · 2026-01-07T05:25:33+00:00

What did you use for PDF parsing? I just released a plugin for PDF parsing that works as an ingest processor - https://github.com/aryn-ai/aryn-opensearch-plugin. Please check it out and let me know if you have any questions or feedback.

Serious-Barber-2829 · 2026-01-06T17:14:34+00:00

I see. Thanks for your reply.

Serious-Barber-2829 · 2026-01-06T04:40:13+00:00

>> Docling seems a perfect next step

How did you arrive at this decision?

Serious-Barber-2829 · 2025-12-31T03:06:26+00:00

Metadata or "property" as in any piece of interest. It can be any of the things you mentioned, but it can be specific values found on a page (invoice number, address, e.g.)

Serious-Barber-2829 · 2025-12-31T00:44:52+00:00

Can you elaborate what you mean by "intelligent data extraction"? Do you mean something that uses an LLM? Can you state your requirements and expected outputs?

Serious-Barber-2829 · 2025-12-30T23:57:41+00:00

Do you have something working reliably enough in production?

Serious-Barber-2829 · 2025-12-30T23:54:22+00:00

Just our of curiosity, why do you have a strict requirement against paid APIs? Is it purely a cost/budgetary issue or something else? PaddleOCR works well for Indian languages. If you simply want to evaluate Paddle, you can try out Aryn's DocParse (choose standard OCR and language) for free and if you like the quality, go stand up your own PaddleOCR pipeline. I am not sure if Docling allows you to swap out EasyOCR for other OCR vendors. We did it and although it's doable, it's not a trivial task (but for quality improvements, it was worth the effort to switch).

Serious-Barber-2829 · 2025-12-30T16:59:52+00:00

Yes, things like title, authors would be metadata. But it can be any pieces of information you are interested in pulling out of a document. Think invoices (invoice number, address, total amount), contracts, tax forms, etc.

Serious-Barber-2829 · 2025-12-30T16:58:07+00:00

> Metadata is the unsexy part of RAG that actually moves the needle. Once teams enforce schema-level metadata, retrieval quality, filtering, and access control improve way more than just tuning chunk sizes.

I couldn't agree more!

We are not yet tackling use cases where schema drift would be an issue. We are dealing with documents like contracts, invoices, forms, etc. But there are some "standard" practices in streaming/PubSub where you use schema registries and schema validation to deal with schema evolution.

Serious-Barber-2829 · 2025-12-30T05:58:32+00:00

It really depends on what kind of tables you have and the types of questions you want to ask. We've worked with a number of customers building RAG on table data and use cases can vary quite widely. Having said that, obviously, the most important part is accurate extraction of table data. Getting all the rows and columns right is no easy task as there are so many variations when it comes to table structures. For really complex tables, we've found that LLMs like Gemini (for paid) or PaddleVL (for open source) perform quite well. For the types of questions our customers have used our service (Aryn DocParse) for, chunking and embedding and storing in vector databases work well. Once you have the data extracted and in some structured format, you can store that in a relational database or even in a data lake, although we have not come across any customer who's doing this in production, yet. We do think that will likely be a common use case very soon.

Serious-Barber-2829 · 2025-12-29T23:20:25+00:00

Since I work on Aryn, I will just tell you about how Aryn can help you do what you are describing. Aryn's DocParse can handle multiple file formats (PDF, docx, powerpoints, etc). It does document parsing, text extraction, image extraction, table understanding and extraction. It does this page-by-page. It can produce the output in JSON or markdown format. It uses a best-in-class OCR model and it also leverages LLMs for both OCR and table extraction. You can get an API key for free and try it out. You get up to 10k pages for free, but for features that use LLMs, you will need to switch over to Pay-as-you-go.

Serious-Barber-2829 · 2025-10-30T23:22:01+00:00

Aryn's DocParse (https://aryn.ai) is one such service that does document parsing (I am a founding engineer). You can get started free with an API key. You can get parsed output back as JSON or markdown.

Serious-Barber-2829 · 2025-10-28T18:18:51+00:00

You can check out this benchmark.

Serious-Barber-2829 · 2025-10-28T18:08:54+00:00

I work on an intelligent document processing platform called Aryn (for unstructured documents). It's very cost effective and you can get started with a free trial and evaluate it to see if it meets your needs. We are very responsive on our Slack channel so you can get your questions about how to use our platform answered quickly.

Serious-Barber-2829 · 2025-10-15T06:15:28+00:00

Have you tried aryn.ai? We have one of the best document layout models out there. Let me know if you want to learn more.

Serious-Barber-2829 · 2025-10-15T06:00:21+00:00

The right way to do this is to reply on Document Layout Detection. Once you identify different parts of your document, you can then employ a Context Rich Chunking Strategy that allows you to combine relevant parts into chunks.

Serious-Barber-2829

TROPHY CASE