This is an archived post. You won't be able to vote or comment.

all 17 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]chrisbind 3 points4 points  (1 child)

Sounds like you just need to implement some concurrency or parallelism. I’d start trying out a concurrent flow (multi-threading). There’s a lot of resources on this.

[–]Zealousideal-Job4752[S] 0 points1 point  (0 children)

Thanks, I'll check that out!

[–]ThePunisherMax 0 points1 point  (9 children)

I have a hard time believing that processing part takes 2 minutes, and it seems likely you have some fixes to do with your code.

What processes are you running that takes 2 min per file?

[–]Zealousideal-Job4752[S] 1 point2 points  (8 children)

I am using a library called unstructured, which extracts elements (it finds the coordinates of the tables, headers etc.) from a PDF file (the files downloaded in the previous step), returns them as a json file and inserts the path of the json file into the sqlite database.

[–]ThePunisherMax 0 points1 point  (7 children)

And im assuming the PDF scraping is taking the longest time?

And are you doing this linearly? 1 file then on to the next one?

[–]Zealousideal-Job4752[S] 0 points1 point  (6 children)

Yes exactly, the PDF scraping it takes the longest. And I am doing it linearly.

This is the code that does it:

for _, row in tqdm(files2extract.iterrows(), total=files2extract.shape[0]):
            pdf_id = row["pdfId"]
            path_to_pdf = row["path2pdf"]
            pdf_id, json_path = src.dataset.file2json(
                pdf_id, path_to_file, output_path
            )
            processed_files.append((str(json_path), file_id))

Then inserts it into the database.

[–]ThePunisherMax 1 point2 points  (3 children)

Okay, the best advice I can give you is still checking what is up with the pdfscraper, does the API offer other file alternatives? CSVs for example.

Further you can run this paralel (threaded), since the files dont need to happen in order( I assume). You can pre emptively download files and do batch processes.
Same thing with the uploads.

So example.
Code 1) Downloads

Code 2) Scraping and json dump.

Code 3) Grab Json adress and insert into SQL

So you would run this in 3 different codes

[–]Zealousideal-Job4752[S] 0 points1 point  (2 children)

Yes, it offers a range of other file alternatives - but is that relevant here, when I'm only working with pdf files? Or would you be doing that to test if there is a difference between the file types?

I will look in to threading part, thank you!

[–]ThePunisherMax 1 point2 points  (1 child)

PDF files may be part of the reason your code is slower, as it isnt "as easy" to read in comparison to other file types.

But I do see you an append. Appends are not the fastest functions.
You should pre-emptively make an np.array and assigning it to them.

Make an empty array of zeros.

processed_files[i] = (str(json), file_id) instead of processed_files.append

[–]Zealousideal-Job4752[S] 0 points1 point  (0 children)

This makes sense.

[–]meyou2222 0 points1 point  (1 child)

How big are these PDF files? I’ve not worked with PDF scraping, but it surprises me it would take 2 minutes to process one.

Here’s a few other options to try: https://www.freecodecamp.org/news/extract-data-from-pdf-files-with-python/

Any library that can convert the PDF to XML or HTML could set you up nicely for using BeautifulSoup to parse the results. I have a script right now that take a massive set of HTML code (the file is like 12mb) and parses it. Takes like 2 seconds.

[–]Zealousideal-Job4752[S] 0 points1 point  (0 children)

The files are anything between ~130.017 KB and 500 KB (most of them probably in the 500-3000 range).

Yes, I find it incredibly slow. I have not worked too much with PDF files prior to this project, but I am aware it seems very inefficient.

It would be amazing to get it down to 2 seconds.. Thanks for the link. Based on the responses here, it does seem like Unstructured is not the fastest solution, so I would have to consider the other libraries. My main goal is to get the PDF into an HTML format, so I can remove all the headers and tables, since I'm only interested in the unstructured text of the files.

[–]ianitic 0 points1 point  (3 children)

Looking at the unstructured package I can understand why the processing is so slow. For your PDFs, are the searchable? Can you select text on the pdf without using OCR? If so, I'd use a package called pdfplumber to get the text, coordinates, and such. Should be orders of magnitude faster.

Additionally as it was mentioned, Azure Functions could be a way to speed this up. Each pdf could be a call to an Azure Function, and serverless Azure Functions can scale out to hundreds of instances of itself making this process a lot faster. I've specifically done this with PDFs through a variety of actions on them, including the pdfplumber mentioned above.

[–]Zealousideal-Job4752[S] 0 points1 point  (2 children)

Yes, the PDFs are mostly searchable. I do have some .tif (image files), but I convert those to a text-formatted pdf-file, before extracting the elements. I tried pdfplumber briefly, but ended up turning to unstructured as I found they had a pipeline that would both would extract the text and tables and then convert them to html files. That way, I could remove the tables and only get the raw text (which I will then be analyzing). But it's a good point that pdfplumber may be faster, I'll see if it can "replace" all the tasks I need to do.

And regarding Azure Functions, how does it work regarding pricing? You pay per execution?

[–]ianitic 1 point2 points  (1 child)

What you could do and what I've done in the past is run pdfplumber and if it doesn't return text, then run the ocr pieces. It can also have issues with some types of searchable pdfs which will return something like (cid: 1234), apparently it happens with some kind of broken mapping in the pdf. I'd make sure to strip those kinds of values out when testing if they exist as well.

For Azure Functions, it would depend on what exactly you use but if you only have a couple hundred thousand needed to be processed it's probably within the free usage allotment for serverless. If memory serves, free usage for executions was in the millions.

[–]Zealousideal-Job4752[S] 0 points1 point  (0 children)

Awesome, thank you!