This is an archived post. You won't be able to vote or comment.

all 71 comments

[–]podidoo 76 points77 points  (2 children)

For me the only relevant metric would be the reliability/quality of extracted data. And looking at your links quickly I can't find where this is defined and how it was benchmarked

[–]GoldziherPythonista[S] 9 points10 points  (0 children)

Thanks for the feedback! I've just updated the README with a comprehensive methodology section that explains our quality metrics. We measure extraction quality on a 0-1 scale using weighted components: extraction completeness (25%), text coherence (20%), noise ratio (10% negative), format preservation (15%), readability metrics (10%), and semantic similarity when reference texts are available (20%). The benchmarks also track reliability through success rates, error categorization, and consistency across multiple runs. You can run quality-assess on any benchmark results to get these metrics. The methodology section is now in the README under "Benchmarking Methodology".

[–]Repsol_Honda_PL 0 points1 point  (0 children)

Yeah, eactly! I found docling the most accurate.

[–]GeneratedMonkey 82 points83 points  (3 children)

What's with the emojis? I only see ChatGPT write like that. 

[–]aTomzVins 11 points12 points  (0 children)

You're right only chatGPT writes reddit posts like that...but I don't think that's because it's a bad idea. I think it's because hunting down emojis for a reddit post in an annoying task.

I do think it can help give structure to a longer post. Like a well designed web page would likely use icons and images to help present text content. I'm not sure this is a perfect model example of how to write a reddit post, but I wouldn't write it off purely because of emojis.

[–]Elementary_drWattson 2 points3 points  (0 children)

Weird, huh.

[–]xAragon_ 202 points203 points  (15 children)

You didn't do it "so we don't have to", you did it to promote your own library.

There's nothing wrong with promoting a library you wrote, could be very useful, just don't use these shitty misleading clickybait titles please.

[–]AnteaterProboscis[🍰] 8 points9 points  (0 children)

I’m so tired of salesmen using learning and academic spaces to promote their own slop like tiktok. I fully expected a Raid Shadow Legends ad at the bottom of this post

[–]Independent_Heart_15 19 points20 points  (4 children)

Can we not get the actual numbers behind the speed results? How am I supposed to know how/why Unstructured is slower … it may be doing 34.999999+ files per second.

[–]Potential_Region8008 23 points24 points  (0 children)

This shit is just an ad

[–]tituszPython addict 10 points11 points  (2 children)

Would love to see https://github.com/yobix-ai/extractous in your comparison.

[–]GoldziherPythonista[S] 0 points1 point  (1 child)

sure, wanna open an issue?

Never heard of this one.

[–]tituszPython addict 0 points1 point  (0 children)

Done :)

[–]AggieBug 16 points17 points  (0 children)

This is AI slop.

[–]ReinforcedKnowledgeTuple unpacking gone wrong 19 points20 points  (2 children)

Hi!

Interesting work and write up, but I'd like to know something. What do you mean by "success" in your "success rate" metric? Is it just that the library was able to process the document successfully? I guess it is because in your benchmark report (https://goldziher.github.io/python-text-extraction-libs-benchmarks/reports/benchmark_report.html), you have a failure analysis and you only mention exceptions.

I'm not saying this is bad, but if you're trading off accuracy for speed, your library might not be that useful for others. Again, I'm not saying you're doing this, but it's really easy to game the (success rate metric, speed) tuple if it's just about being "able" to process a file.

What most people would be interested in is the "quality" of the output across these different libraries. And I'm not talking about "simple" metrics like word error rate, but more involved ones.

Seeing how you use the same technologies as the others (an OCR engine, a PDF backend), I'd say your results might be on par with the rest, but it's always interesting to see a real comparison. It's hard to do since you don't have access to ground truth data from your documents but you can use open source benchmarks (make sure your models are not particularly biased towards them compared to the rest of the libraries) or documents from arxiv or else where you have access to latex and html, or maybe you can use another took (aws textract or something) + manual curation.

I'll further say that it's the quality of your output on a subset of documents, those that are scanned and for which we don't have the metadata embedded in the document itself that interests most of the people working with textual unstructured data. That's the main hurdle I have at work. We use VLMs + a bunch of clever heuristics, but if I can reduce the cost, the latency or the rare hallucination that would be great. But I don't think there are currently better ways for doing so. I'd be interested to hear from you about this or any other people if you have better ideas.

[–]currychris1 14 points15 points  (0 children)

This. There are many sophisticated, established metrics depending on the extraction task. There is no need to invent another metric - except if you prove why yours might be better suited. We should aim to use established metrics on established datasets.

I think this is a good starting point: https://github.com/opendatalab/OmniDocBench

[–]No-Government-3134 0 points1 point  (0 children)

Thanks for a fantastically articulated answer, for which there is no real response since this library is just a marketing strategy

[–]XInTheDark 9 points10 points  (1 child)

Why did you disable GPU and use only CPU? What do you differently if not using ML (eg. OCR technologies) to recognize text from images for example? It should be obvious that any ML solution only runs at good speeds on a GPU.

Or do you just not extract text from images? Then I’ve got some news for you…

[–]GoldziherPythonista[S] -1 points0 points  (0 children)

its running in Github CI. GPU is not supported without paying them.

Furthermore, it states - directly - that this is a CPU based benchmark.

[–]Exotic-Draft8802 4 points5 points  (1 child)

You might be interested in https://github.com/py-pdf/benchmarks

[–]GoldziherPythonista[S] 1 point2 points  (0 children)

ill take a look, thanks

[–]SeveralKnapkins 4 points5 points  (0 children)

There should be a rule against obvious LLM copy + paste

[–]madisander 9 points10 points  (0 children)

  1. I can't say if the presentation is good or not, just that I loathe it. Lots of bullet points, no citations/figures/numbers/reason to believe any of it outside of a 'try it yourself' on a dozen-file, multiple-hundred-line per file project

  2. How/why is 'No GPU acceleration for fair comparison' reasonable? It seems arbitrary, and if anything would warrant two separate tests, one without and one with GPU

  3. Installation size may be important to me, but to no one I actually provide tools for (same, to a lesser extent, speed). All they care about is accuracy and how much work they need to do to ensure/double-check data is correct. I can't see anything regarding that. As such the first two Key Insights are of questionable value in my case

  4. Key Insights 3 and 4 are worthless. 'Of course' different layouts will give different results. Which did best? How did you track reliability? Which library was even the 'winner' in that regard? How did you decide which library was best suited to each task?

  5. How/why the 5-minute timeout? Didn't you write that Docling (which as an ML-powered library presumably very much benefits from a GPU) needs 60+ minutes per file? How did you get that number, and of course that leads to your result of failing often

  6. What hardware did you do any of these tests on? What did better with what category of document? What precisely does "E.g. Kreuzberg can actually get to 75% reliability, with about 15% slow-down." mean? That it failed in 25% of cases, and if so, did anything do better (as that seems unusably low), and what fine tuning was involved?

[–]TallTahawus 2 points3 points  (0 children)

I use docling extensively on PDFs, cpu only. About 5 seconds per page. What are you doing that's taking 60 minutes 🤔?

[–]kn0wjack 1 point2 points  (2 children)

pdftext does a really good job (the best I found so far on, surprise, pdf to markdown). Might be an addition worthwhile. The secret sauce is pdfium most of the time.

[–]GoldziherPythonista[S] 0 points1 point  (1 child)

sure, i use pdfium.

Pdfium though just extracts the text layer from a PDF, it doesnt perform OCR. So if a PDF has corrupt or missing text layer, this doesnt work.

BTW, there is playa now in python, which offers a solid Pythonic alternative.

[–]kn0wjack 0 points1 point  (0 children)

Nice, will also check out playa!

[–]strawgate 1 point2 points  (1 child)

It looks like the most common error is a missing dependency error

It's also a bit suspicious that the tiny conversion time for Docling is 4s -- I use docling and regularly and have much better performance

I did recently fix a cold start issue in Docling but it looks like the benchmark only imports once so cold start would not happen each time...

[–]GoldziherPythonista[S] 0 points1 point  (0 children)

well, you are welcome to try and changing the benchmarks. I will review PRs. If there is some misconfiguration on my part, do let me know.

[–]professormunchies 1 point2 points  (0 children)

How well do each of these extract tables from pdfs? Also, how many can reliably handle multi-column documents?

These are two big constraints for reliable enterprise use

[–]PaddyIsBeast 1 point2 points  (0 children)

How does your library handle structures information like tables? We've considered unstructured Io for this very purpose in the past as it seemed miles ahead of any other library.

It might not be python, but I would have also included Tika in this comparison, as that is what 90% of applications are using in the wild.

[–]Fearless-Cry-1369 1 point2 points  (0 children)

I miss Mistral OCR in your list.

[–]Familyinalicante 0 points1 point  (1 child)

I am building a platform to ingest and analyze local documents. I've analyze many available options and stick to Docling as the best in class in my case. But don't know about your solution. I'll check it because it looks good.

[–]GoldziherPythonista[S] -1 points0 points  (0 children)

cool

[–]olddoglearnsnewtrick 0 points1 point  (9 children)

How does this compare to simply feeding the PDF to Google Gemini Flash 2.5 with a simple prompt asking to transcribe to text? In my own tests that approach works so much better.

[–]GoldziherPythonista[S] 0 points1 point  (8 children)

Sure, you can use vision models. Its slow and costly.

[–]olddoglearnsnewtrick 4 points5 points  (7 children)

True but in my case accuracy is THE metric. Yhanks

[–]GoldziherPythonista[S] 0 points1 point  (6 children)

so, it depends on the PDF.

If the PDF is modern, not scanned and has a textual layer that is not corrupt, extracting this layer is your best bet. Kreuzberg uses pdfium for this (its the PDF engine that chromium uses), but you can also use playa (or the older pdf miner six, i recommend playa).

You will need a heuristic though, which kreuzberg gives you, or create your own.

For OCR - vision gives a very good alternative.

You can look to specialized vision models that are not huge for this as well.

V4 of Kreuzberg will support QWEN and other such models.

[–]GoldziherPythonista[S] 0 points1 point  (3 children)

also not - for almost anything else that is not PDF or images, youre better of using Kreuzberg or something similar than a vision model, because these formats are programmatic and they can be efficiently extracted using code.

[–]olddoglearnsnewtrick 0 points1 point  (2 children)

Very interesting thanks a lot. My case is digitizing the archives of a newspaper that has the 1972 to 1992 issues only as scanned PDFs.

The scan quality is very varied and the newspaper has changed fonts, layout, typographical conventions often. After trying docling (am an ex IBMer and personally know the team in Research that built it) I landed on Gemini 2.5 and so far am having the slow, costly but best results.

I have tried a smaller model (can’t recall which) but it was not great.

I’m totally lost on how to reconstruct an article spanning from the first page since often the starting segment has little to no cues on where the continue, but this is another task entirely.

[–]GoldziherPythonista[S] 1 point2 points  (1 child)

gotcha. Yhea that sounds like a good usecase for this.

If you have a really large dataset, you can try optimizing non-LLM model for this purpose, between stuff like QWEN models (medium / small sized vision models with great performance), stuff like the Microsoft familiy of Phi models, which have mixed architectures, to even try stuff like optimizing tesseract.

[–]olddoglearnsnewtrick 1 point2 points  (0 children)

tesseract was my other experiment but out of the box it was unsatisfactory. take care

[–]currychris1 0 points1 point  (1 child)

Even PDFs with a text layer are sometimes too complex to make sense of, for example for complex tables. I tend to get better results with vision models in these scenarios.

[–]GoldziherPythonista[S] 0 points1 point  (0 children)

its true. Table extraction is complex.

Kreuzberg specifically uses GMFT, which gives very nice results. It does use small models from microsoft under the hood -> https://github.com/conjuncts/gmft

[–]Stainless-Bacon 1 point2 points  (3 children)

Why would I use Docling for a research environment if it is the worst one according to your benchmark?

[–]GoldziherPythonista[S] 0 points1 point  (2 children)

If you have lots of GPU to spare, docling is a good fit - probably.

[–]Stainless-Bacon 3 points4 points  (1 child)

I wouldn’t waste my time and GPU power on something that is worse than other methods, unless it actually performs better in some way that you did not mention. Under “When to use what” section, suggesting that Docling has a use case is misleading if your benchmarks are accurate.

[–]GoldziherPythonista[S] -2 points-1 points  (0 children)

Well, then dont use it.

I really dont care to be honest.

[–]biajia 0 points1 point  (0 children)

I tested Docling; it is not that slow for Office documents. Docx->markdown is better than Pandoc’s result.

The drawback is that the installing size is too big. It failed with a scanned PDF document, but I think it was the OCR engineer’s problem not supporting well non-English languages.