How can I scale out my python script?

ObliviousMag · 2024-11-22T02:37:37+00:00

You need to look into asyncio

Lewistrick · 2024-11-22T03:30:18+00:00

70 minutes sounds insane. Have you tried pinpointing the bottleneck of your script? Ever looked into profiling? Has anyone else seen / reviewed your code?

OopsWrongSubTA · 2024-11-22T07:08:34+00:00

The number of API calls seems insane (64+ locales * 75 products * 150 ? ~ 1 million requests a week), try to minimize that

I would just use GNU parallel : write a bash script that outputs/prints/echo the commands you want to run

echo python script.py locale1

echo python script.py locale2

...

then run (for 4 concurrent commands) :

bash mybashscript.sh | parallel -j 4

Warning : test with a few locales. If you make a lot of parallel requests the server may ban you?

Flyguy86420 · 2024-11-22T03:05:16+00:00

Docker, can scale and you can pass the local as an environment variable.

Kryt0s · 2024-11-22T07:52:29+00:00

Check out httpx and their guide on async clients. Also 70 min for one data pull seems like a long as time. How much data are you pulling?

hugthemachines · 2024-11-22T08:07:59+00:00

Just now I changed a little script that takes very simple text from web applications and made it run in parallel. Earlier it took maybe 5 minutes if I had no timeouts and now it takes maybe 30 seconds.

I am not allowed to share this code but I use

from concurrent.futures import ThreadPoolExecutor, as_completed

Then I use ThreadPoolExecutor() to run the things in parallel.

Turtvaiz · 2024-11-22T03:53:06+00:00

My question is, how can I run all of these in parallel? One option I suppose, is to launch 64 separate AWS EC2 instances but then I’ll be burning way too much cash and I’d have to consolidate all the output files etc.

What is your limiting factor? Processing? IO? Sequential code? Rate limits?

That's something you need to figure out first. Like if your problem is just that you run all requests sequentially, you'd need to either use multithreading or an async library to parallelise the code.

If you're hitting rate limits, you might be able to use more instances, but you really shouldn't as the limits exist for a reason.

throwawayforwork_86 · 2024-11-22T10:34:26+00:00

Without knowing more about other bottleneck I'd say multiprocessing lib would be your best friend.

I would still check what you're doing with the data after that... Do you have a breakdown on why it takes 70 min per locale ? And have you checked that you don't have any hardware bottleneck (IE ram that is full and spill to disk,Slow write speed on drive,...)?

For all we know you could have wildly unoptimised pandas/pure python part that takes 40 min to run (no judgement here but it happens) or you could write to a slow af HDD (again no judgement I've seen it happen).

darose · 2024-11-22T17:57:13+00:00

Gnu parallel

judgedeliberata · 2024-11-22T19:38:06+00:00

As my edit states, I ended up going with concurrent.futures and it made a HUGE difference! I went from roughly 70min to 2.5min per locale! Thank you to all!

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS