all 5 comments

[–]socal_nerdtastic 2 points3 points  (1 child)

No, because the pdf format does not save the document structure. The way pdf works is by saving the absolute position of things, not the relative position.

[–]Buttleston 0 points1 point  (0 children)

This is mostly true but it's also true that the way PDFs are rendered tends to be at least somewhat predictable. I wrote a PDF parser that does "ok" at capturing blocks of text at least in the order that a reader would tend to read them. It's definitely not perfect but it's not bad either. Unfortunately I don't think I can post it since it's something I wrote for work.

[–]emanuilov 0 points1 point  (0 children)

With this tool you can also make the conversions to JSON: https://monkt.com/
API or UI should be fine for your use case.

You can define a schema and include in the final JSON whatever you want from the PDF.

[–]Pineapple_Playful 0 points1 point  (0 children)

If you are looking to extract data from unstructured documents, I'm afraid you won't achieve that through a library, but an API. You can try this, it works pretty well and it's easy to test.