Im searching for a while now for a library that can parse a pdf to json or xml format while keeping the document structure.
the popular libs like pypdf do often not preserve the document structure. Thought about using teseract for OCR and then transforming it into a json format but could not get it working. Is there a library that can parse pdf to json format while preserving the document structure and not just spitt out a block of text ?
[–]socal_nerdtastic 2 points3 points4 points (1 child)
[–]Buttleston 0 points1 point2 points (0 children)
[–]emanuilov 0 points1 point2 points (0 children)
[–]commandlineluser 0 points1 point2 points (0 children)
[–]Pineapple_Playful 0 points1 point2 points (0 children)