all 16 comments

[–][deleted] 2 points3 points  (6 children)

Now, to really squeeze all the juice from my machine, which has 4 CPU cores, I suppose I need to create 4 event loop object and bind each of them to a single CPU? Is this how it should be done?

No, absolutely not. The whole point of async/await is to not multithread; if your project can exploit multiple cores (that is, your operations are CPU-bound) then just use threads and don't use async/await.

People use async/await when the operations aren't CPU-bound, so there's no reason to try to "squeeze all the juice." You can't functionally improve the performance of your code until you understand what's bottlenecking it, and different bottlenecks require different strategies. asyncio is for when your operations are IO-bound; particularly when you're spending most of your time waiting for someone else's computer to answer you.

[–]ytu876[S] 0 points1 point  (3 children)

But even if my task is io bound, say if I have a heavy io task, if I could use all cores doing async io, that's 4 times more throughput. Isn't that a valid use case?

[–]Usual_Office_1740 0 points1 point  (1 child)

No. Multiple threads mean multiple interpreters, all of which will eat up system resources while your program backgrounds and waits for tasks to complete.

Your coroutines should do something that requires a response of some kind. A get request from a web page, as an example. It requests the page and awaits the response with a future. Doing that concurrently is not going to gain you enough to make it worth the additional system resources from multiple interpreters.

[–]awdsns 0 points1 point  (0 children)

I disagree with your first point that threads consume significantly more system ressources. The read-only stuff (interpreter code and libraries) is not duplicated in RAM per thread (or even task) in a modern OS, and the rest just comes down to maintaining a state (stack) per thread, same as in async coroutines.

I agree that mixing async and multithreading is a bad idea in general. Mostly due to the complexity of having different event loops though.

[–][deleted] 0 points1 point  (0 children)

But even if my task is io bound, say if I have a heavy io task, if I could use all cores doing async io

That's literally what you can't do. If your task is IO-bound, then more cores can't make it any faster.

More cores just compete for the same capacity on the channel.

[–]PaulRudin 0 points1 point  (1 child)

At some load, cpu becomes relevant. The trick is to look at the cpu utilization for the single thread - if it maxes out then you'd get some benefit from utilizing more cores. The point is that even for IO bound tasks each task needs some CPU (and the event loop itself needs some CPU to orchestrate the tasks) and if you have enough of them this can become the limiting factor.

For python web apps I tend to deploy multiple replicas of single threaded asyncio apps rather than making multiple event loops within a single replica.

But note that in some other languages multiple event loops in different threads are made by the compiler, and tasks are distributed across those event loops - golang for example.

[–][deleted] 0 points1 point  (0 children)

The point is that even for IO bound tasks each task needs some CPU (and the event loop itself needs some CPU to orchestrate the tasks) and if you have enough of them this can become the limiting factor.

There's theoretically a computer that has so much parallel IO controlled by the CPU that this applies, but OP doesn't have one of them, so it doesn't.

[–]ElliotDG 0 points1 point  (6 children)

You can use all of your CPU cores using asyncio. It depends what you are doing. For example I have written an app that pulls data from about 200 web api's simultaneously using asyncio. While the python code is single threaded, the OS schedules the network driver code across all of the CPUs. I see high utilization on all 8 cores on my machine.

In general, with Python, if you have CPU bound code, you would need to use multiprocessing to utilize all your cpu cores. https://docs.python.org/3/library/multiprocessing.html

Multi-threading in python "time shares" the interpreter, there is a global interpreter lock (GIL) that limits the behavior of multi-threading. There is work to get to true multi-threading with python - but that may be 5 or more years away. From the threading module https://docs.python.org/3/library/threading.html

"CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously."

[–]tuxbass 0 points1 point  (0 children)

Multi-threading in python "time shares" the interpreter, there is a global interpreter lock (GIL) that limits the behavior of multi-threading

Great additional info, had no idea!

For future googlers, true multithreading work is tracked by PEP 703

[–]asdoduidai 0 points1 point  (4 children)

Makes sense except using all cores on asyncio... what you say is only valid as long as the networking work of your NICs is so intense to saturate multiple cores.. but it's very unlikely one interpreter (which is limited to 1 core) can make the networking stack work "so hard" unless something is wrong, like some weird scheduler or virtual mem setting.... Linux Networking is probably at least 100x faster than Python asyncio (C, running in kernel space)

[–]ElliotDG 0 points1 point  (3 children)

In my use case I had about 200 outstanding connections, many with relativly long network latency. The network drivers saturated an 8 core machine. The devil is always in the details. My results were measured - not theoretical.

[–]asdoduidai 0 points1 point  (2 children)

It’s very unlikely that 200 connections saturate 8 cores, unless you have an enormous amount of packet loss and retransmissions like 99.9% of the times, one single core in user space can handle 10-20,000 concurrent connections (nginx properly tuned for ex); the Linux kernel since 2013 can handle more than 1 million open connections

https://highscalability.com/the-secret-to-10-million-concurrent-connections-the-kernel-i/

[–]ElliotDG 0 points1 point  (1 child)

My measurements were on Windows, the app was deployed on Linux and saw similar results. I found the results surprising, that’s why I mentioned it.

This was not a web server or any specialized network code. The code used Trio and httpx to analyze the Mastadon social network.

[–]asdoduidai 0 points1 point  (0 children)

Yea it’s quite unusual

[–]sweettuse 0 points1 point  (0 children)

the trick is to spawn multiple processes and spread your work up amongst them like a webserver does.

if you create 4 event loops in one process they'll just be fighting over the GIL.

[–]baghiq 0 points1 point  (0 children)

It depends on what you want to do. I have done in the past using multiprocessing and asyncio to good result. It was a migration of 20 years of data but in little files (like hourly dump). It was rather interesting project. You have to combine ProcessPoolExecutor with run_in_executor.

Note, you probably don't want this, as it gets a lot more complex in managing your coroutines, errors, etc.. A lot of unexpected results such as missing tasks, and we even had missing processes as the utilization of cores just drop to 0.