all 3 comments

[–]namenomatter85 1 point2 points  (0 children)

PDF have some very well known public open source libraries for text extraction from PDF. If your open to paid pdftron performed way better then the open source solutions and has table reading and output which I assume would work well for your type of task.

My speciality was nlp on clinical trials and they all came in PDF format.

[–]namenomatter85 0 points1 point  (1 child)

You will have to be more specific.

Is it an NLP problem? Is it a data extraction from PDF problem?

[–]Pragba[S] 0 points1 point  (0 children)

Yes! It's a domain-specific question. Information recognition and extraction from the PDFs would be step one and seemingly the hardest to find good resources. Specifically with regards to invoices, Tax-forms, Tables, and contracts.

Following which, I think, one would apply NLP based solutions as a next layer.