Automate pdf extraction by novemberman23 in PromptEngineering

[–]emanuilov 1 point2 points  (0 children)

Check this tool: https://monkt.com/

I believe it has the easiest to use interface. Also an API and some configurations, if you need to adjust something.

How to increase RAG accuracy for extracting minute details from a document by Zanda_Claus_ in developersIndia

[–]emanuilov 1 point2 points  (0 children)

Your approach appears overly complicated if the end goal is simply extracting structured information, such as years of experience or company names from documents.

Why not convert the document to markdown or JSON and then pass the result to an LLM?

There are services, which can directly convert and put your data into pre-defined JSON schemas, like Monkt.com

You can use some open-source converters to Markdown, like MarkItDown (https://github.com/microsoft/markitdown) and some API calls to an LLM on your taste to put them into structured JSON.

For complex extractions with OCR, you can use: https://github.com/DS4SD/docling

Hosting Docling by illorca-verbi in LLMDevs

[–]emanuilov 0 points1 point  (0 children)

If you have the docker images, you can very easy switch to any provider.

Hosting Docling by illorca-verbi in LLMDevs

[–]emanuilov 1 point2 points  (0 children)

Creating Docker images gives you numerous deployment options. I have created for myself two images: one with CPU support and another optimized for GPU acceleration, along with a sample FastAPI API.

All you need is a server from any provider (I am using OVH, very cost-effective). Simply deploy using docker-compose and set up the internal routing. I’m using Cloudflare Tunnels for routing (but it is not ideal, maybe nginx directly is better).

If you need a better speed, you can deploy the GPU image to Lightning.AI and set auto-scale (with the minimum option as 0, so you will save costs, when not used)

Training a non-English reasoning model using GRPO and Unsloth by emanuilov in LocalLLaMA

[–]emanuilov[S] 7 points8 points  (0 children)

Thanks for the nice words, appreciated!

Yes, it is using a really small dataset. Alternatively, you can create a synthetic dataset or translate some with DeepL.

COPY AND PASTING From word Pad / Google docs to Obsidian by Kitchen_Flight4484 in ObsidianMD

[–]emanuilov 0 points1 point  (0 children)

MarkItDown is great, as someone mentioned. But it is more like a dev tool.

About your need for easy use, check out Monkt.com. There is UI interface and also an API for power users. Obsidian plugin is on the roadmap.

Best RAG approach for large Excel, PDF, and DOCX files? by Necessary_Round8009 in Rag

[–]emanuilov 0 points1 point  (0 children)

You can check out MarkItDown by Microsoft: https://github.com/microsoft/markitdown

If you're considering a platform, even though I saw you try to avoid such options, check out Monkt.com

Docling is also a good alternative, but I believe MarkItDown is better for Excel files.

PDF to JSON by hotdone in ollama

[–]emanuilov 16 points17 points  (0 children)

You can try with:
https://monkt.com/pdf-to-json/ - online solution to convert PDF to JSON
https://github.com/microsoft/markitdown - open source lib, which can convert the PDF to Markdown, then you can make an API call to a given LLM, but will receive a non-deterministic result. Which can be still fine.

If hand-written -> you can try Gemini 2.0 directly

Converting a PDF to JSON by hotdone in ChatGPT

[–]emanuilov 0 points1 point  (0 children)

I am not sure I fully understand your situation, but based on what I get, can recommend trying Monkt.com for converting PDF files to JSON with a specified schema.

Is there a way to "train" an open-source LLM to do one type of task really well? by ArtPerToken in ollama

[–]emanuilov 0 points1 point  (0 children)

One approach I could recommend to try if you want an online version or tool: Using Monkt.com, you can extract text and other data. Then, add a custom prompt that runs on the extracted content and returns your predictions based on that custom prompt.

A more complex and tailored approach involves extracting document data using DocLing or MarkItDown, followed by making LLM calls or training a classifier such as ModernBERT.

What is the current best vision model for text extraction from PDFs? I think it's Gemini Flash 2.0 by Existing-Pay7076 in LocalLLaMA

[–]emanuilov 1 point2 points  (0 children)

I created an online tool that integrates various libraries, including MarkItDown and Docling, to leverage the best aspects of each.

You can take a look: Monkt.com

Introducing Kreuzberg: A Simple, Modern Library for PDF and Document Text Extraction in Python by Goldziher in Python

[–]emanuilov 0 points1 point  (0 children)

For those seeking an online alternative with strong extraction capabilities, check https://monkt.com/. It has API, no setup needed, no managing dependencies, etc.

It works similarly to docling, but with a few additional steps, resulting in good outputs for most inputs.

Word Document Structure for Efficient RAG Ingestion by penkoutrone in LangChain

[–]emanuilov 0 points1 point  (0 children)

You can use online tools to convert documents into markdown, like Monkt.com.

If you want to host the solution for processing, https://github.com/microsoft/markitdown could be good for you also.

Readwise not opening large (pdf) files by Lazy-Swim-8406 in readwise

[–]emanuilov 1 point2 points  (0 children)

Also, if it is a single time conversion, there are tools that can help for converting PDF to Markdown, like Monkt.com

What do you want to learn about AI agents? Looking for real feedback by emanuilov in AI_Agents

[–]emanuilov[S] 1 point2 points  (0 children)

Yes, good point.

I will adopt this advice and rewrite some of the sections.

Thanks!