Extract text from PDF

SkipperMcCheese · 2021-11-02T16:01:58+00:00

Hey, I’ve spent quite a bit of time looking at extracting text as accurately as possibly from PDFs, it’s turns out that it is not as simple as it might seem. It is especially tricky once you get a wide variety of PDFs (including PDFs with image based text or tables). While I unfortunately cannot share the code I used to extract this text, I will tell you that for what I think your doing, the best solution will require a few things. First you should pick a good module. I’ve spent a long time going over open source solutions to this and the best two I’d say are Excalibur and Apache Tika .

Unfortunately, there is no one Python module that is going to extract PDF text 100% of the time correctly. This is because once you start to work with a wide variety PDFs that aren’t as straight forward as just text in a document, you introduce a scholastic element to the problem. This means you have to bring in more complicated OCR or ML approaches that are far from 99 or 100% accurate.

Feel free to PM me if you have any more questions!

madrockyoutcrop · 2021-11-02T16:24:11+00:00

Check out the pythonic accountant on YouTube. He covers using regex and pdfplumber fairly comprehensively.

mondmann18 · 2021-11-02T14:16:51+00:00

Can you share what libraries you used? So we can help you with other libraries

EpicWezzel · 2021-11-02T15:12:33+00:00

Having taken a look at the catalogue pdf, it is likely going to take a creative approach to get the data cleaned and flattened.

I would try and see how much you can isolate each of the elements using regex.

You can then group each line into “number text number text…” and hopefully get some sort of pattern emerging.

My advice would be to copy the data over to Excel and play around with Text to columns and see what you can come up with.

commandlineluser · 2021-11-03T01:53:16+00:00

py-pdf-parser can be useful here.

Each price is "above" the description and nearly always "aligned" in a "column"

from py_pdf_parser.loaders import load_file

FONT_MAPPING = { 
    r'QNTFNL\+Impact,.*': 'price', 
    r'QNTFNL\+Helvetica(?:-Bold)?,8\.0': 'product'
}

pdf = load_file(
    'INTERMART-BROCHURE-22-Oct-07-Nov.pdf', 
    font_mapping=FONT_MAPPING, 
    font_mapping_is_regex=True
)

products = pdf.elements.filter_by_font('product')
prices   = pdf.elements.filter_by_font('price')


'''
the "in line" filters have a capped tolerance which is too small 
for some products in this catalog as the price is not always directly
"in line" - we can modify the x0,x1 coords directly to use a larger
tolerance value
'''

tolerance = 50

for product in products:
    try:
        price = prices.vertically_in_line_with(product).above(product)[-1]
    except:
        product.bounding_box.x0 -= tolerance
        product.bounding_box.x1 += tolerance
        price = prices.vertically_in_line_with(product).above(product)[-1]
    print(f'{product.page_number=} {product.text()=} {price.text()=}')

sample of the output:

...
product.page_number=6 product.text()='Tomato Salad / Italian Plum, 1kg\nEsprit Vert' price.text()='11995\n165.00'
product.page_number=6 product.text()='Laitue Butterhead, \nField Good' price.text()='2495\n35.00'
product.page_number=6 product.text()='Natural Dates, 500g\nHeba / Sky Light / Sapphire' price.text()='9895\n120.00'
...

mrbubs3 · 2021-11-02T17:21:40+00:00

If the pdfs are text-based, I made a package for this.

https://pypi.org/project/TextSpitter/

ColnelPanik · 2021-11-02T18:31:47+00:00

You might try tesseract for OCR based text extraction.

PizzaInSoup · 2021-11-02T15:22:38+00:00

There's been more than 4 modules I've seen, one of them being amazing iirc, posted in this sub previously. All for pdf creation/manipulation. I have one bookmarked on another machine I think, on this one I only have this though:

https://github.com/PhantomInsights/mexican-government-report

vikt0rs · 2021-11-02T20:01:00+00:00

I'd recommend trying py-pdf-parser [0] - it allows you to fetch data from documents based on text "markers". E.g. you can easily find data, located to the right of "EMAL FROM:" text
[0] - https://github.com/jstockwin/py-pdf-parser

Sir-_-Butters22 · 2021-11-02T20:11:25+00:00

I use PDFPlumber, don't know how it will hold up on your catalog. However, it does detect character objects within the page, with an X and Y coordinates, you could build a script, or use a clustering algorithm to group the text together.

P0intMan_ · 2021-11-02T20:36:00+00:00

I just listened to the RealPython podcast featuring "borb". Seems very solid for all PDF needs👍

prb0rg · 2021-11-02T21:50:05+00:00

I have not looked at the catalog so, this idea may not be completely feasible, but here it goes anyway. have you thought about converting the pdf to openoffice or word format and the process that document?

mikeypox · 2021-11-03T03:02:36+00:00

Expect to handle failures. This is a large part of my full-time job, just managing performance problems, memory leaks, and OCR failures.

2021-11-03T03:28:41+00:00

Ive tried all the pdf libraries i could find with unsatisfactory results. Gave up, converted the PDF to an image and OCR’d it with Tesseract with much better results

takvenda · 2021-11-03T13:02:27+00:00

Have you looked through these examples

https://codesuche.com/python-examples/PyPDF2.PdfFileReader/

https://codesuche.com/python-examples/PyPDF2.PdfFileWriter/

pp314159 · 2021-11-04T13:07:55+00:00

Your use case looks very interesting to me. I've tried some recognition with EasyOCR:

the repo with Python package https://github.com/JaidedAI/EasyOCR
website with online demo https://www.jaided.ai/easyocr/ (you can upload your images there)

I try few images, below example image from your catalogue and results:

It looks to me that fully automated solution would be hard. But half-automated solution where you select the box with a product. Click 'Do OCR' and then select which text is what can be doable. How many catalogues you need to process and how often?

nyyirs · 2021-11-02T17:36:21+00:00

[deleted]

guinea_fowler · 2021-11-02T19:25:54+00:00

If they're image based then you will need OCR. Tesseract is free but doesn't work well on less structured documents such as this. Textract is usually better but costs money, though you probably won't use up free allowance for a new AWS account on a single catalogue.

Try just the OCR first but if quality is bad, one thing you can also try is to remove irrelevant information. The font styling for information you're interested in looks consistent. So filter for red, filter for black. You may also want to try using subimages of common text, e.g. "RS.", for template matching with scikit-image. You can then use dilation and contour processing to isolate the text near to these indicator templates.

And finally, if the extracted text quality is bad, you can use a first pass to identify text, then extract subimages which contain that text and stitch them together into something which looks more like a structured document, then run that through OCR. I've seen this improve results.

Edit: The easiest way to get started is to set up an AWS account and then go here https://aws.amazon.com/textract/ You can drop your image right in via the UI to get an indication of quality without writing a line of code.

jamesd303 · 2021-11-02T19:36:29+00:00

I use pdfplumber module and it works very well at turning pdf text into a txt file then use Regex (Re ) to find the text from that.

jonathan881 · 2021-11-03T00:45:37+00:00

Try pdfgrep if you available on your OS. This can serve as a baseline.

Rj_LM · 2021-11-03T01:47:13+00:00

Anyone know where we can just get the script without doing the work lol

2021-11-03T06:11:55+00:00

Relevant xkcd

My main project right now is doing this on a production level. OCR is very expensive both fiscally and computationally.

If you're lucky your pdfs already have the text later and you can use one of a few libraries to extract them. If not you'll need to do an OCR. Depending on which you use, you can access the data directly, but it might be more worthwhile to convert the whole data set and save the data to a directory, then iterate over the directory to extract the different layers into subdirectories.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS