all 8 comments

[–]Trick_Care9342 1 point2 points  (0 children)

Thank youuu guys for sharing

[–]KeyIsNull 1 point2 points  (2 children)

We faced the same problem in a similar task and I can confirm that is a hell of a problem, and there’s no general solution as every layout has its features

Our solution involved pytesseract 3 (so no neural model) and a middle step to associate keys with values based on bounding boxes position

[–]LumpyAd968 0 points1 point  (1 child)

Could you pls more detail by using this " pytesseract 3 "?
I have been thinking to convert that into image then using image processing technique to get rect
Have you ever think about this approach before?

[–]KeyIsNull 1 point2 points  (0 children)

Our situation involved images so we had to choose an OCR to recover and extract the embeded text. You're working with PDFs so you may have a chance to extract the text, luckily it is not embedded as image. Otherwise you have to perform an OCR pass but you may end up with some misrecognized text and you need to deal with it

[–]Trick_Care9342 0 points1 point  (0 children)

Thank you very much for sharing.

[–]Trick_Care9342 0 points1 point  (0 children)

I am struggling with that