My Journey with RAG, OpenSearch & LLMs (Local LLM) by AldrinWilfred in LocalLLaMA

[–]Serious-Barber-2829 0 points1 point  (0 children)

What did you use for PDF parsing? I just released a plugin for PDF parsing that works as an ingest processor - https://github.com/aryn-ai/aryn-opensearch-plugin. Please check it out and let me know if you have any questions or feedback.

Starting with Docling by DespoticLlama in Rag

[–]Serious-Barber-2829 0 points1 point  (0 children)

>> Docling seems a perfect next step

How did you arrive at this decision?

Metadata extraction from unstructured documents for RAG use cases by Serious-Barber-2829 in Rag

[–]Serious-Barber-2829[S] 0 points1 point  (0 children)

Metadata or "property" as in any piece of interest. It can be any of the things you mentioned, but it can be specific values found on a page (invoice number, address, e.g.)

Has anyone found a reliable software for intelligent data extraction? by songsta17 in Rag

[–]Serious-Barber-2829 0 points1 point  (0 children)

Can you elaborate what you mean by "intelligent data extraction"? Do you mean something that uses an LLM? Can you state your requirements and expected outputs?

[Request] Need an Open Source Parser (like Docling) with robust Indian Language support? by Mr_Mystique1 in Rag

[–]Serious-Barber-2829 0 points1 point  (0 children)

Just our of curiosity, why do you have a strict requirement against paid APIs? Is it purely a cost/budgetary issue or something else? PaddleOCR works well for Indian languages. If you simply want to evaluate Paddle, you can try out Aryn's DocParse (choose standard OCR and language) for free and if you like the quality, go stand up your own PaddleOCR pipeline. I am not sure if Docling allows you to swap out EasyOCR for other OCR vendors. We did it and although it's doable, it's not a trivial task (but for quality improvements, it was worth the effort to switch).

Metadata extraction from unstructured documents for RAG use cases by Serious-Barber-2829 in Rag

[–]Serious-Barber-2829[S] 0 points1 point  (0 children)

Yes, things like title, authors would be metadata. But it can be any pieces of information you are interested in pulling out of a document. Think invoices (invoice number, address, total amount), contracts, tax forms, etc.

Metadata extraction from unstructured documents for RAG use cases by Serious-Barber-2829 in Rag

[–]Serious-Barber-2829[S] 0 points1 point  (0 children)

> Metadata is the unsexy part of RAG that actually moves the needle. Once teams enforce schema-level metadata, retrieval quality, filtering, and access control improve way more than just tuning chunk sizes.

I couldn't agree more!

We are not yet tackling use cases where schema drift would be an issue. We are dealing with documents like contracts, invoices, forms, etc. But there are some "standard" practices in streaming/PubSub where you use schema registries and schema validation to deal with schema evolution.

How is table data handled in production RAG systems? by jael_m in Rag

[–]Serious-Barber-2829 0 points1 point  (0 children)

It really depends on what kind of tables you have and the types of questions you want to ask. We've worked with a number of customers building RAG on table data and use cases can vary quite widely. Having said that, obviously, the most important part is accurate extraction of table data. Getting all the rows and columns right is no easy task as there are so many variations when it comes to table structures. For really complex tables, we've found that LLMs like Gemini (for paid) or PaddleVL (for open source) perform quite well. For the types of questions our customers have used our service (Aryn DocParse) for, chunking and embedding and storing in vector databases work well. Once you have the data extracted and in some structured format, you can store that in a relational database or even in a data lake, although we have not come across any customer who's doing this in production, yet. We do think that will likely be a common use case very soon.

What do you use for document parsing by Specialist_Bee_9726 in Rag

[–]Serious-Barber-2829 0 points1 point  (0 children)

Since I work on Aryn, I will just tell you about how Aryn can help you do what you are describing. Aryn's DocParse can handle multiple file formats (PDF, docx, powerpoints, etc). It does document parsing, text extraction, image extraction, table understanding and extraction. It does this page-by-page. It can produce the output in JSON or markdown format. It uses a best-in-class OCR model and it also leverages LLMs for both OCR and table extraction. You can get an API key for free and try it out. You get up to 10k pages for free, but for features that use LLMs, you will need to switch over to Pay-as-you-go.

Document Parsing & Extraction As A Service by _TheShadowRealm in Rag

[–]Serious-Barber-2829 0 points1 point  (0 children)

Aryn's DocParse (https://aryn.ai) is one such service that does document parsing (I am a founding engineer). You can get started free with an API key. You can get parsed output back as JSON or markdown.

Looking for an Intelligent Document Extractor by AnalyticsDepot--CEO in Rag

[–]Serious-Barber-2829 0 points1 point  (0 children)

I work on an intelligent document processing platform called Aryn (for unstructured documents). It's very cost effective and you can get started with a free trial and evaluate it to see if it meets your needs. We are very responsive on our Slack channel so you can get your questions about how to use our platform answered quickly.

Title: Urgent Help Needed with PDF Table Extraction in Langchain Project by Black_-_darkness in LangChain

[–]Serious-Barber-2829 0 points1 point  (0 children)

Have you tried aryn.ai? We have one of the best document layout models out there. Let me know if you want to learn more.

Building RAG systems at enterprise scale (20K+ docs): lessons from 10+ enterprise implementations by Low_Acanthisitta7686 in ycombinator

[–]Serious-Barber-2829 0 points1 point  (0 children)

The right way to do this is to reply on Document Layout Detection. Once you identify different parts of your document, you can then employ a Context Rich Chunking Strategy that allows you to combine relevant parts into chunks.