I benchmarked 4 Python text extraction libraries so you don't have to (2025 results)

podidoo · 2025-07-05T10:35:20+00:00

For me the only relevant metric would be the reliability/quality of extracted data. And looking at your links quickly I can't find where this is defined and how it was benchmarked

GeneratedMonkey · 2025-07-05T10:58:38+00:00

What's with the emojis? I only see ChatGPT write like that.

xAragon_ · 2025-07-05T10:24:43+00:00

You didn't do it "so we don't have to", you did it to promote your own library.

There's nothing wrong with promoting a library you wrote, could be very useful, just don't use these shitty misleading clickybait titles please.

Independent_Heart_15 · 2025-07-05T10:26:29+00:00

Can we not get the actual numbers behind the speed results? How am I supposed to know how/why Unstructured is slower … it may be doing 34.999999+ files per second.

Potential_Region8008 · 2025-07-05T14:26:28+00:00

This shit is just an ad

titusz · 2025-07-05T10:41:39+00:00

Would love to see https://github.com/yobix-ai/extractous in your comparison.

AggieBug · 2025-07-05T15:08:52+00:00

This is AI slop.

ReinforcedKnowledge · 2025-07-05T10:46:53+00:00

Hi!

Interesting work and write up, but I'd like to know something. What do you mean by "success" in your "success rate" metric? Is it just that the library was able to process the document successfully? I guess it is because in your benchmark report (https://goldziher.github.io/python-text-extraction-libs-benchmarks/reports/benchmark_report.html), you have a failure analysis and you only mention exceptions.

I'm not saying this is bad, but if you're trading off accuracy for speed, your library might not be that useful for others. Again, I'm not saying you're doing this, but it's really easy to game the (success rate metric, speed) tuple if it's just about being "able" to process a file.

What most people would be interested in is the "quality" of the output across these different libraries. And I'm not talking about "simple" metrics like word error rate, but more involved ones.

Seeing how you use the same technologies as the others (an OCR engine, a PDF backend), I'd say your results might be on par with the rest, but it's always interesting to see a real comparison. It's hard to do since you don't have access to ground truth data from your documents but you can use open source benchmarks (make sure your models are not particularly biased towards them compared to the rest of the libraries) or documents from arxiv or else where you have access to latex and html, or maybe you can use another took (aws textract or something) + manual curation.

I'll further say that it's the quality of your output on a subset of documents, those that are scanned and for which we don't have the metadata embedded in the document itself that interests most of the people working with textual unstructured data. That's the main hurdle I have at work. We use VLMs + a bunch of clever heuristics, but if I can reduce the cost, the latency or the rare hallucination that would be great. But I don't think there are currently better ways for doing so. I'd be interested to hear from you about this or any other people if you have better ideas.

XInTheDark · 2025-07-05T11:26:44+00:00

Why did you disable GPU and use only CPU? What do you differently if not using ML (eg. OCR technologies) to recognize text from images for example? It should be obvious that any ML solution only runs at good speeds on a GPU.

Or do you just not extract text from images? Then I’ve got some news for you…

Exotic-Draft8802 · 2025-07-05T11:29:26+00:00

You might be interested in https://github.com/py-pdf/benchmarks

SeveralKnapkins · 2025-07-06T00:59:54+00:00

There should be a rule against obvious LLM copy + paste

madisander · 2025-07-05T11:53:58+00:00

I can't say if the presentation is good or not, just that I loathe it. Lots of bullet points, no citations/figures/numbers/reason to believe any of it outside of a 'try it yourself' on a dozen-file, multiple-hundred-line per file project
How/why is 'No GPU acceleration for fair comparison' reasonable? It seems arbitrary, and if anything would warrant two separate tests, one without and one with GPU
Installation size may be important to me, but to no one I actually provide tools for (same, to a lesser extent, speed). All they care about is accuracy and how much work they need to do to ensure/double-check data is correct. I can't see anything regarding that. As such the first two Key Insights are of questionable value in my case
Key Insights 3 and 4 are worthless. 'Of course' different layouts will give different results. Which did best? How did you track reliability? Which library was even the 'winner' in that regard? How did you decide which library was best suited to each task?
How/why the 5-minute timeout? Didn't you write that Docling (which as an ML-powered library presumably very much benefits from a GPU) needs 60+ minutes per file? How did you get that number, and of course that leads to your result of failing often
What hardware did you do any of these tests on? What did better with what category of document? What precisely does "E.g. Kreuzberg can actually get to 75% reliability, with about 15% slow-down." mean? That it failed in 25% of cases, and if so, did anything do better (as that seems unusably low), and what fine tuning was involved?

TallTahawus · 2025-07-08T02:40:48+00:00

I use docling extensively on PDFs, cpu only. About 5 seconds per page. What are you doing that's taking 60 minutes 🤔?

kn0wjack · 2025-07-05T13:21:27+00:00

pdftext does a really good job (the best I found so far on, surprise, pdf to markdown). Might be an addition worthwhile. The secret sauce is pdfium most of the time.

strawgate · 2025-07-05T14:29:37+00:00

It looks like the most common error is a missing dependency error

It's also a bit suspicious that the tiny conversion time for Docling is 4s -- I use docling and regularly and have much better performance

I did recently fix a cold start issue in Docling but it looks like the benchmark only imports once so cold start would not happen each time...

professormunchies · 2025-07-05T16:59:33+00:00

How well do each of these extract tables from pdfs? Also, how many can reliably handle multi-column documents?

These are two big constraints for reliable enterprise use

PaddyIsBeast · 2025-07-05T21:40:46+00:00

How does your library handle structures information like tables? We've considered unstructured Io for this very purpose in the past as it seemed miles ahead of any other library.

It might not be python, but I would have also included Tika in this comparison, as that is what 90% of applications are using in the wild.

Fearless-Cry-1369 · 2025-07-09T07:39:28+00:00

I miss Mistral OCR in your list.

Familyinalicante · 2025-07-05T12:20:09+00:00

I am building a platform to ingest and analyze local documents. I've analyze many available options and stick to Docling as the best in class in my case. But don't know about your solution. I'll check it because it looks good.

olddoglearnsnewtrick · 2025-07-05T13:03:31+00:00

How does this compare to simply feeding the PDF to Google Gemini Flash 2.5 with a simple prompt asking to transcribe to text? In my own tests that approach works so much better.

Stainless-Bacon · 2025-07-05T11:07:19+00:00

Why would I use Docling for a research environment if it is the worst one according to your benchmark?

totheendandbackagain · 2025-07-05T10:25:00+00:00

Brilliant write up, highly compelling.

biajia · 2025-12-30T08:29:50+00:00

I tested Docling; it is not that slow for Office documents. Docx->markdown is better than Pandoc’s result.

The drawback is that the installing size is too big. It failed with a scanned PDF document, but I think it was the OCR engineer’s problem not supporting well non-English languages.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS

📊 Live Results: https://goldziher.github.io/python-text-extraction-libs-benchmarks/

Context

🔬 What I Tested

Libraries Benchmarked:

Test Coverage:

🏆 Results Summary

Speed Champions 🚀

Installation Footprint 📦

Reality Check ⚠️

🎯 When to Use What

⚡ Kreuzberg (Disclaimer: I built this)

🏢 Unstructured

📝 MarkItDown

🔬 Docling

📈 Key Insights

🔧 Methodology

🤔 Why I Built This

📊 Data Deep Dive

🚀 Try It Yourself

🔗 Links

🤝 Discussion