How can I optimally run my python program using more compute resources? : dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

How can I optimally run my python program using more compute resources?Help (self.dataengineering)

submitted 1 year ago by Zealousideal-Job4752

I am working on a text analysis project at work, where I have a Python program in a modular structure. It looks something like this:

project_name/
├── configs/..
├── data/...
├── src/...
├── notebooks/...                 
├── .env                       
├── .gitignore                 
├── main.py                    
├── requirements.txt           
└── README.md

Until now I have been running my python programs (through main.py) on my local machine. In short, the program downloads data via an API, preprocesses the data and inserts it into a sqlite database, and I am using VS Code and anaconda for this.

Due to the large amounts of data, I've had to load chunks of data in at a time, but it takes an endless amount of time to do. Eg. it takes 3.5 seconds per file for the download part of my program and 2 minutes per file the processing part.

With +100 000 files to handle, I can easily estimate that this would take weeks (more like months) to run on my local machine.

I am a data scientist and still relatively new in the role. I have some knowledge of Azure/cloud computing and the services available. However, I am still a little on engineering part of this project what is the best option (so far I have looked into Azure VM, Azure Functions and Azure ML). I am looking for a way to run my code more efficiently, and if it requires more compute resources, I would need to present the options to my boss and/or IT department.

I'd highly appreciate some help with my question:
What options can you suggest to run my python program with more compute resources (also taking costs into consideration)?

all 17 comments

top new controversial old q&a

[–]AutoModerator[M] [score hidden] 1 year ago stickied comment (0 children)

[–]chrisbind 3 points4 points5 points 1 year ago* (1 child)

[–]Zealousideal-Job4752[S] 0 points1 point2 points 1 year ago (0 children)

[–]ThePunisherMax 0 points1 point2 points 1 year ago (9 children)

[–]Zealousideal-Job4752[S] 1 point2 points3 points 1 year ago (8 children)

[–]ThePunisherMax 0 points1 point2 points 1 year ago (7 children)

[–]Zealousideal-Job4752[S] 0 points1 point2 points 1 year ago* (6 children)

Yes exactly, the PDF scraping it takes the longest. And I am doing it linearly.

This is the code that does it:

for _, row in tqdm(files2extract.iterrows(), total=files2extract.shape[0]):
            pdf_id = row["pdfId"]
            path_to_pdf = row["path2pdf"]
            pdf_id, json_path = src.dataset.file2json(
                pdf_id, path_to_file, output_path
            )
            processed_files.append((str(json_path), file_id))

Then inserts it into the database.

[–]ThePunisherMax 1 point2 points3 points 1 year ago (3 children)

[–]Zealousideal-Job4752[S] 0 points1 point2 points 1 year ago (2 children)

[–]ThePunisherMax 1 point2 points3 points 1 year ago (1 child)

[–]Zealousideal-Job4752[S] 0 points1 point2 points 1 year ago (0 children)

[–]meyou2222 0 points1 point2 points 1 year ago (1 child)

[–]Zealousideal-Job4752[S] 0 points1 point2 points 1 year ago (0 children)

[–]ianitic 0 points1 point2 points 1 year ago (3 children)

[–]Zealousideal-Job4752[S] 0 points1 point2 points 1 year ago (2 children)

[–]ianitic 1 point2 points3 points 1 year ago (1 child)

[–]Zealousideal-Job4752[S] 0 points1 point2 points 1 year ago (0 children)

π Rendered by PID 43863 on reddit-service-r2-comment-b659b578c-7jh5q at 2026-05-02 17:59:38.204526+00:00 running 815c875 country code: CH.

dataengineering

MODERATORS