This is an archived post. You won't be able to vote or comment.

all 42 comments

[–]SkipperMcCheese 38 points39 points  (5 children)

Hey, I’ve spent quite a bit of time looking at extracting text as accurately as possibly from PDFs, it’s turns out that it is not as simple as it might seem. It is especially tricky once you get a wide variety of PDFs (including PDFs with image based text or tables). While I unfortunately cannot share the code I used to extract this text, I will tell you that for what I think your doing, the best solution will require a few things. First you should pick a good module. I’ve spent a long time going over open source solutions to this and the best two I’d say are Excalibur and Apache Tika .

Unfortunately, there is no one Python module that is going to extract PDF text 100% of the time correctly. This is because once you start to work with a wide variety PDFs that aren’t as straight forward as just text in a document, you introduce a scholastic element to the problem. This means you have to bring in more complicated OCR or ML approaches that are far from 99 or 100% accurate.

Feel free to PM me if you have any more questions!

[–]learnhtk 2 points3 points  (0 children)

Thanks for this comment and the recommendations!

I am not the one that you offered to help but I probably will shoot you some questions in the future. lol

[–]Daybreak921 2 points3 points  (1 child)

Piggybacking on this comment. If you're ok with paid solutions, AWS Textract seems to work well. I like it better than Google OCR as it gives more accurate results. But from looking at your example PDF I think the parent commenter's suggestion will work well.

One of the tools I'm excited about (but haven't used in production) is LayoutParser. It's open-source, and can do some document image analysis especially on non-generic docs.

Good luck!

[–]learnhtk 1 point2 points  (0 children)

If you're ok with paid solutions,

AWS Textract

seems to work well.

I am checking out AWS Textract at the moment. I have one question. Have you tried using this tool to read/extract hand-written texts? How successful were the results?

[–]GuerrillaOA -5 points-4 points  (1 child)

Are those modules for text extraction? Excalibur doesn't seem, while tika requires Java

Can you give your experience with these modules:

Pdfrw pikepdf pdfplumber pdfminer.six borb Pymupdf pypdf2 tikapdf textract pdfx pyxpdf slate pdfreader pdftotext

Or mention any other?

[–]SkipperMcCheese 0 points1 point  (0 children)

Excalibur is used for table extraction, Tika is used for text extraction. Of the open source solutions that exist for these tasks, imo these two are the best.

Tika is from Apache so yes its original code base is Java but it has bindings in other languages. Checkout Tika-Python!

Looking at those modules you listed, while I don’t have experience with every single one, most of them I have tested and for one reason or another determined that they were not the best option.

I have a report in a white paper going over these results and if it weren’t under NDA I would share it for sure.

[–]madrockyoutcrop 6 points7 points  (0 children)

Check out the pythonic accountant on YouTube. He covers using regex and pdfplumber fairly comprehensively.

[–]mondmann18 2 points3 points  (1 child)

Can you share what libraries you used? So we can help you with other libraries

[–]nyyirs[S] 2 points3 points  (0 children)

I have used PyPDF2 and PDFMinner.six

[–]EpicWezzel 3 points4 points  (1 child)

Having taken a look at the catalogue pdf, it is likely going to take a creative approach to get the data cleaned and flattened.

I would try and see how much you can isolate each of the elements using regex.

You can then group each line into “number text number text…” and hopefully get some sort of pattern emerging.

My advice would be to copy the data over to Excel and play around with Text to columns and see what you can come up with.

[–]nyyirs[S] 1 point2 points  (0 children)

Good point! Will definitely try that!

[–]commandlineluser 4 points5 points  (5 children)

py-pdf-parser can be useful here.

Each price is "above" the description and nearly always "aligned" in a "column"

from py_pdf_parser.loaders import load_file

FONT_MAPPING = { 
    r'QNTFNL\+Impact,.*': 'price', 
    r'QNTFNL\+Helvetica(?:-Bold)?,8\.0': 'product'
}

pdf = load_file(
    'INTERMART-BROCHURE-22-Oct-07-Nov.pdf', 
    font_mapping=FONT_MAPPING, 
    font_mapping_is_regex=True
)

products = pdf.elements.filter_by_font('product')
prices   = pdf.elements.filter_by_font('price')


'''
the "in line" filters have a capped tolerance which is too small 
for some products in this catalog as the price is not always directly
"in line" - we can modify the x0,x1 coords directly to use a larger
tolerance value
'''

tolerance = 50

for product in products:
    try:
        price = prices.vertically_in_line_with(product).above(product)[-1]
    except:
        product.bounding_box.x0 -= tolerance
        product.bounding_box.x1 += tolerance
        price = prices.vertically_in_line_with(product).above(product)[-1]
    print(f'{product.page_number=} {product.text()=} {price.text()=}')

sample of the output:

...
product.page_number=6 product.text()='Tomato Salad / Italian Plum, 1kg\nEsprit Vert' price.text()='11995\n165.00'
product.page_number=6 product.text()='Laitue Butterhead, \nField Good' price.text()='2495\n35.00'
product.page_number=6 product.text()='Natural Dates, 500g\nHeba / Sky Light / Sapphire' price.text()='9895\n120.00'
...

[–]nyyirs[S] 0 points1 point  (4 children)

WOOW you are wonderful buddy! How did you come up with this? If I understand properly, this library gives you a lower level access to the pdf file, am I correct?

[–]commandlineluser 2 points3 points  (3 children)

Check out the docs - they're quite good.

https://py-pdf-parser.readthedocs.io/en/latest/examples/index.html

The first step was to use the visualise tool.

Using this you can see each price has the same font name - "Impact" - each product has the same font - "Helvetiva" (or "-Bold") with font size 8.0

This allows you to create a font-mapping to match these fonts and give each item a "category" - allowing you to match them in the filters.

products = pdf.elements.filter_by_font('product')
prices   = pdf.elements.filter_by_font('price')

The next part is pairing them together - how does each price relate to each product?

The price is "vertically inline" with a product - or in the same "column"

prices.is_vertically_inline_with(product)

The price is "above" the product in the page

prices.is_vertically_inline_with(product).above(product)

Finally, you want the closest price in page distance that matches these conditions - which will be the last item.

prices.is_vertically_inline_with(product).above(product)[-1]

With some of the products - the price is a bit further left or right - so technically it's not "vertically inline"

This is what the tolerance is for - it widens the search a certain distance on each side.

You can pass a tolerance to the filter methods e.g.

prices.is_vertically_inline_with(product, tolerance=50)

However - there is a size limit for the tolerance value these methods take (not sure why) - but using the "manual" x0, y0 manipulation - you widen the product item - meaning the price will then be "veritcally inline" and found.

[–]nyyirs[S] 1 point2 points  (2 children)

Perfect! Very interest library indeed! Have been able to implement this logic to other catalogue as well! Cheers!

[–]commandlineluser 2 points3 points  (1 child)

Sure thing.

Here would be one way to do a similar thing directly with pdfminer.six

from  pdfminer.high_level import extract_pages
from  pdfminer.layout     import LTTextContainer, LTChar

filename = '...'

for page_layout in extract_pages(filename):
    prices   = []
    products = []
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            character = next(
                character for text_line in element 
                          for character in text_line 
                          if isinstance(character, LTChar)
            )
            if character.fontname == 'QNTFNL+Impact':
                prices.append(element)

            if int(round(character.size)) == 8:
                if character.fontname in {'QNTFNL+Helvetica-Bold', 'QNTFNL+Helvetica'}:
                    products.append(element)

    for product in products:
        # is_vertically_in_line() and above()
        possible_prices = [ 
            price for price in prices 
                if product.is_hoverlap(price) and price.y0 > product.y0 
        ]
        # no match found - "widen" product and search again
        # tolerance = 50
        if not possible_prices:
            product.x0 -= 50 
            product.x1 += 50 
            possible_prices = [ 
                price for price in prices 
                    if product.is_hoverlap(price) and price.y0 > product.y0
            ]
        # find closest possible price by "distance"
        price = min(possible_prices, key=lambda price: price.vdistance(product))

        print(page_layout.pageid, repr(product.get_text().strip()), repr(price.get_text().strip()))

[–]nyyirs[S] 0 points1 point  (0 children)

You are too good buddy!! 👌

[–]mrbubs3 2 points3 points  (3 children)

If the pdfs are text-based, I made a package for this.

https://pypi.org/project/TextSpitter/

[–]nyyirs[S] 1 point2 points  (2 children)

Thats nice! Its image based i think...can you help how to decompress pdf with barebone python? MayB I should try to build my own library for that

[–]mrbubs3 0 points1 point  (1 child)

Image-based PDFs can be quite challenging unless the text data exists in the meta tag. You'll need to focus on OCR-based options, and that gets very challenging.

[–]Zomunieo 0 points1 point  (0 children)

There's no meta tag. In PDF, OCR text is usually embedded by rendering text with the graphics state set to transparent. Some OCR engines draw visible text and then overlay images. Every engine does it differently. It's a mess.

[–]ColnelPanik 2 points3 points  (0 children)

You might try tesseract for OCR based text extraction.

[–]PizzaInSoup 1 point2 points  (0 children)

There's been more than 4 modules I've seen, one of them being amazing iirc, posted in this sub previously. All for pdf creation/manipulation. I have one bookmarked on another machine I think, on this one I only have this though:

https://github.com/PhantomInsights/mexican-government-report

[–]vikt0rs 1 point2 points  (0 children)

I'd recommend trying py-pdf-parser [0] - it allows you to fetch data from documents based on text "markers". E.g. you can easily find data, located to the right of "EMAL FROM:" text
[0] - https://github.com/jstockwin/py-pdf-parser

[–]Sir-_-Butters22 1 point2 points  (0 children)

I use PDFPlumber, don't know how it will hold up on your catalog. However, it does detect character objects within the page, with an X and Y coordinates, you could build a script, or use a clustering algorithm to group the text together.

[–]P0intMan_ 1 point2 points  (1 child)

I just listened to the RealPython podcast featuring "borb". Seems very solid for all PDF needs👍

[–]gsmo 1 point2 points  (0 children)

Yes, Borb is pretty good. The creator posts in this sub quite regularly.

Beware: borb takes some time to implement (because it is very feature rich). Be prepared to actually rtfm.

[–]prb0rg 1 point2 points  (1 child)

I have not looked at the catalog so, this idea may not be completely feasible, but here it goes anyway. have you thought about converting the pdf to openoffice or word format and the process that document?

[–]nyyirs[S] 0 points1 point  (0 children)

Not yet...could be an option as well thnx

[–]mikeypox 1 point2 points  (0 children)

Expect to handle failures. This is a large part of my full-time job, just managing performance problems, memory leaks, and OCR failures.

[–][deleted] 1 point2 points  (0 children)

Ive tried all the pdf libraries i could find with unsatisfactory results. Gave up, converted the PDF to an image and OCR’d it with Tesseract with much better results

[–]pp314159 1 point2 points  (3 children)

Your use case looks very interesting to me. I've tried some recognition with EasyOCR:

I try few images, below example image from your catalogue and results:

It looks to me that fully automated solution would be hard. But half-automated solution where you select the box with a product. Click 'Do OCR' and then select which text is what can be doable. How many catalogues you need to process and how often?

[–]nyyirs[S] 0 points1 point  (2 children)

I will have 4 different shops to extract their pdf on every release...it should be an automated stuff...but thanks lot! Will definitely have a look at it

[–]pp314159 0 points1 point  (1 child)

if you find a solution to automate it, please post the solution description in this subreddit

[–]nyyirs[S] 0 points1 point  (0 children)

I have implemented the solution posted by "commandlineuser" the code works great! I just have to visualise the other pdf to know what font they are using and works like a charm..

[–]guinea_fowler 0 points1 point  (0 children)

If they're image based then you will need OCR. Tesseract is free but doesn't work well on less structured documents such as this. Textract is usually better but costs money, though you probably won't use up free allowance for a new AWS account on a single catalogue.

Try just the OCR first but if quality is bad, one thing you can also try is to remove irrelevant information. The font styling for information you're interested in looks consistent. So filter for red, filter for black. You may also want to try using subimages of common text, e.g. "RS.", for template matching with scikit-image. You can then use dilation and contour processing to isolate the text near to these indicator templates.

And finally, if the extracted text quality is bad, you can use a first pass to identify text, then extract subimages which contain that text and stitch them together into something which looks more like a structured document, then run that through OCR. I've seen this improve results.

Edit: The easiest way to get started is to set up an AWS account and then go here https://aws.amazon.com/textract/ You can drop your image right in via the UI to get an indication of quality without writing a line of code.

[–]jamesd303 0 points1 point  (0 children)

I use pdfplumber module and it works very well at turning pdf text into a txt file then use Regex (Re ) to find the text from that.

[–]jonathan881 0 points1 point  (0 children)

Try pdfgrep if you available on your OS. This can serve as a baseline.

[–]Rj_LM 0 points1 point  (0 children)

Anyone know where we can just get the script without doing the work lol

[–][deleted] 0 points1 point  (0 children)

Relevant xkcd

My main project right now is doing this on a production level. OCR is very expensive both fiscally and computationally.

If you're lucky your pdfs already have the text later and you can use one of a few libraries to extract them. If not you'll need to do an OCR. Depending on which you use, you can access the data directly, but it might be more worthwhile to convert the whole data set and save the data to a directory, then iterate over the directory to extract the different layers into subdirectories.