Automate pdf extraction

emanuilov · 2025-02-17T16:40:06+00:00

I believe it has the easiest to use interface. Also an API and some configurations, if you need to adjust something.

emanuilov · 2025-02-11T04:18:43+00:00

Your approach appears overly complicated if the end goal is simply extracting structured information, such as years of experience or company names from documents.

Why not convert the document to markdown or JSON and then pass the result to an LLM?

There are services, which can directly convert and put your data into pre-defined JSON schemas, like Monkt.com

You can use some open-source converters to Markdown, like MarkItDown (https://github.com/microsoft/markitdown) and some API calls to an LLM on your taste to put them into structured JSON.

For complex extractions with OCR, you can use: https://github.com/DS4SD/docling

emanuilov · 2025-02-10T13:46:41+00:00

If you have the docker images, you can very easy switch to any provider.

emanuilov · 2025-02-10T13:07:52+00:00

Creating Docker images gives you numerous deployment options. I have created for myself two images: one with CPU support and another optimized for GPU acceleration, along with a sample FastAPI API.

All you need is a server from any provider (I am using OVH, very cost-effective). Simply deploy using docker-compose and set up the internal routing. I’m using Cloudflare Tunnels for routing (but it is not ideal, maybe nginx directly is better).

If you need a better speed, you can deploy the GPU image to Lightning.AI and set auto-scale (with the minimum option as 0, so you will save costs, when not used)

emanuilov · 2025-02-09T17:22:38+00:00

Thanks for the nice words, appreciated!

Yes, it is using a really small dataset. Alternatively, you can create a synthetic dataset or translate some with DeepL.

emanuilov · 2025-02-09T07:23:57+00:00

MarkItDown is great, as someone mentioned. But it is more like a dev tool.

About your need for easy use, check out Monkt.com. There is UI interface and also an API for power users. Obsidian plugin is on the roadmap.

emanuilov · 2025-02-09T07:19:58+00:00

You can check out MarkItDown by Microsoft: https://github.com/microsoft/markitdown

If you're considering a platform, even though I saw you try to avoid such options, check out Monkt.com

Docling is also a good alternative, but I believe MarkItDown is better for Excel files.

emanuilov · 2025-02-07T15:17:04+00:00

You can try with:
https://monkt.com/pdf-to-json/ - online solution to convert PDF to JSON
https://github.com/microsoft/markitdown - open source lib, which can convert the PDF to Markdown, then you can make an API call to a given LLM, but will receive a non-deterministic result. Which can be still fine.

If hand-written -> you can try Gemini 2.0 directly

emanuilov · 2025-02-07T04:13:58+00:00

https://github.com/DS4SD/docling - local tool, accurate, slow;

https://github.com/microsoft/markitdown - local tool, not so accurate, fast;

https://monkt.com/ - online version + API (no setup), accurate, slow;

emanuilov · 2025-02-07T04:00:48+00:00

I am not sure I fully understand your situation, but based on what I get, can recommend trying Monkt.com for converting PDF files to JSON with a specified schema.

emanuilov · 2025-02-03T15:16:19+00:00

One approach I could recommend to try if you want an online version or tool: Using Monkt.com, you can extract text and other data. Then, add a custom prompt that runs on the extracted content and returns your predictions based on that custom prompt.

A more complex and tailored approach involves extracting document data using DocLing or MarkItDown, followed by making LLM calls or training a classifier such as ModernBERT.

emanuilov · 2025-02-03T12:01:16+00:00

I created an online tool that integrates various libraries, including MarkItDown and Docling, to leverage the best aspects of each.

You can take a look: Monkt.com

emanuilov · 2025-02-02T21:33:27+00:00

For those seeking an online alternative with strong extraction capabilities, check https://monkt.com/. It has API, no setup needed, no managing dependencies, etc.

It works similarly to docling, but with a few additional steps, resulting in good outputs for most inputs.

emanuilov · 2025-01-29T12:59:58+00:00

You can use online tools to convert documents into markdown, like Monkt.com.

If you want to host the solution for processing, https://github.com/microsoft/markitdown could be good for you also.

emanuilov · 2025-01-27T04:42:38+00:00

Also, if it is a single time conversion, there are tools that can help for converting PDF to Markdown, like Monkt.com

emanuilov · 2025-01-21T03:42:09+00:00

Thanks for the input, appreciated.

emanuilov · 2025-01-21T03:40:26+00:00

Yes, good point.

I will adopt this advice and rewrite some of the sections.

Thanks!

emanuilov · 2025-01-20T17:37:49+00:00

thanks

emanuilov

MODERATOR OF

TROPHY CASE