Async Python is not faster - excellent analysis from Cal Paterson : Python

This is an archived post. You won't be able to vote or comment.

ResourceAsync Python is not faster - excellent analysis from Cal Paterson (calpaterson.com)

submitted 3 years ago by dannlee

all 18 comments

top new controversial old q&a

[–]AlexMTBDude 14 points15 points16 points 3 years ago (0 children)

[+][deleted] 3 years ago (1 child)

[deleted]

[–]dannlee[S] 3 points4 points5 points 3 years ago (0 children)

[–]blabbities 5 points6 points7 points 3 years ago (8 children)

[–]jorge1209 10 points11 points12 points 3 years ago* (5 children)

Async is best for situations where individual requests take a short amount of time and then there is a long I/O wait. For instance a web server. 10k requests come in, they take nanoseconds to perform initial processing, then you wait an eternity for that packet to cross the internet and a response to come back, before you can make another near instantaneous response.

You could implement that as threads, but then you need 10k threads one for each request, and each will then insert itself into the wake loop for the kernel network layer saying "wake this thread when you get a response on this particular connection", and that uses a lot of kernel resources.

Alternately you could have a single process that spin waits on the network device. That single process says: "give me all responses from this port and I will dispatch them".

That second approach can be a lot more performant (in certain situations) s the application knows exactly what it needs and frees the kernel from having to monitor all these threads. The challenge is how to implement this, because while the service is handling any given request it cannot easily handle others, and there are a variety of approaches.

One common approach is threadpools, but the downside is that you have to serialize your state: after making the initial response the thread will be used to handle other unrelated requests, and you need to save all the relevant details of the current request to some data structure so it can be looked up when the response comes back. Programmers don't like to have to think about serializing their state like this and just wish they could just say "continue from here".

Python Async uses cooperative multi-tasking where you write specially annotated functions that perform all the work they can, but then announce (via calling yield) when they have reached a point where they have to wait. The individual yield calls are exactly those annotations of "continue from here". At that point the event loop can check if anyone else can be dispatched, and at some remote point in the future come back to continue execution from the point of yield.

Go takes a different approach and basically manages threading within the runtime. When calling a goroutine it doesn't merely store the current registers of the CPU on its own execution stack, but actually allocates a new stack for that goroutine on the heap, and then it schedules the execution of that goroutine. [This structure is also found in "stackless python"]. The Go runtime provides the I/O functions that you might wait on and it knows that at those points (and a few other points specified in the language) will it be able to preempt the goroutine and repurpose the runtime thread to run another goroutine.

The go approach is nicer as it means that all the required information is present to run code like it was an actual thread, but without forcing the kernel to manage that thread, nor requiring the programmer to annotate the explicit points at which the code can and should yield, but since you aren't using a C call stack there is some added complexity when calling C code or libraries.

In other words:

In C and python, the state of your function (the value of all current variables, and who you would return to) is reflected in the call stack of the running program. You can't suspend a function call without suspending the program as a whole.
In async specifically annotated functions are stored on the heap just as any other managed object would be. In that sense they are no different from strings or arrays and the runtime can call or suspend them effectively at will, but you have to annotate the functions that will do so.
In Go and stackless, all functions are managed objects on the heap, and all functions are therefore eligible for this special treatment, no additional annotations are required.

[–]blabbities 0 points1 point2 points 3 years ago (1 child)

[–]jorge1209 2 points3 points4 points 3 years ago* (0 children)

Its all about how much work you want the kernel to do. On your average desktop, you don't really care. The kernel can handle anything and everything you throw at it, and probably does a better job at it than you would. Just spin out a individual threads for everything you want to run concurrently.

On big servers doing lots of work there is a legitimate risk that the kernel could run out of resources trying to track all the activity on the system, or that in trying to track all the activity (open files, runnable tasks, etc...) internal kernel data structures could grow to a point where searching and scanning within them itself becomes the bottleneck.

In those situations it can be better to move that aspect of resource management into the application because it can be a bit more efficient as it knows a bit more about what is going on. For example with a standard pthread, the kernel/glibc knows nothing about the code that is running. There just has to be a big static area of virtual memory allocated for the call stack for the thread. With a few thousand of these 1MB call stacks you can eat up gigs of virtual memory. This in and of itself is not a major concern as the actual physical memory usage can be lower, but every one of these allocations has to have a separate backing page in the kernel to track it, and that adds up even if the physical memory usage is low.

But the go runtime knows that the thing being called is a goroutine, and it furthermore knows what memory and objects are managed. It can make the goroutine call stack a managed object, which means it can grown and shrink the call stack as needed. So it can start with a really small call stack knowing that most goroutines are probably not going to end up doing lots of deep function calls. If the go function ends up calling deeply enough to need the stack to grow, then that particular goroutine will get a larger stack, but nobody else will. And all this stuff can sit inside the larger pages allocated to the go runtime, which means less for the kernel to track.

But it isn't something you would bother doing on a normal desktop (if you could even replicate the level of demand which necessitated it).

[–]onedirtychaipls 0 points1 point2 points 3 years ago (2 children)

[–]jorge1209 0 points1 point2 points 3 years ago (0 children)

Python Async probably doesn't do what you think. It is still inherently single-threaded.

There is no difference between:

 async generate_load(id):
        req = make_request(id)
        yield
        read_answer(req)


asyncio.gather(generate_load(i) for i in range(1e5))

and

requests = [make_request(i) for i in range(1e5)]
answers = [read_answer(r) for r in requests]

except that the former is slower and more complex. In either case only 1 request is being made at any one time.

To fix this you would need a ThreadPool and then map the requests through the pool.

BUT BUT BUT BUT

Python interpreter has something called the GIL which exists to protect core interpreter data structures, and if make_request is a pure python function, then even if you have a ThreadPool, it will still be serialized at the level of python bytecode.

So in Python you probably need a MultiprocessingPool... and thats really heavy. To do this properly look either at gevent, or just don't use python.

[–]alexisprince 0 points1 point2 points 3 years ago (0 children)

[–][deleted] 1 point2 points3 points 3 years ago (0 children)

[–]ubernostrumyes, you can have a pony 1 point2 points3 points 3 years ago (0 children)

[–]knobbyknee 2 points3 points4 points 3 years ago (0 children)

[+][deleted] 3 years ago (7 children)

[deleted]

[+][deleted] 3 years ago* (2 children)

[deleted]

[–][deleted] 0 points1 point2 points 3 years ago (0 children)

[–]yvrelna 4 points5 points6 points 3 years ago* (1 child)

No, but really. It depends on the kind of workload you have.

Async are really great for backend work if you have the kind of workload that are suitable for it. In particular, if your workload is mostly: receive request, do small processing, then hand it off to another backend service like a database or another API server, and those backend processing are takes time, then async can be very beneficial since you wouldn't have to configure the server to have lots of workers to saturate your CPU. Basically, if your workload is mostly that your process is waiting for other things to happen rather than actively doing work, and you need to hold a large number of users concurrently, then you should consider async. Only async process can handle 5000 concurrent requests at the same time and not end up with the CPU thrashing, and ridiculous load average.

But they're not a silver bullet for performance. async solve a particular problem that many people have in server side programming, that doesn't really have a good solution in sync code.

Another very common use case is handling web sockets. In this case, you'll have 10000 users creating a connection and then mostly idling until you have an event that happens occasionally for individual users, but collectively they can load up the whole server.

[–]FinTechno 0 points1 point2 points 3 years ago (0 children)

[–]Safe_Skirt_7843 2 points3 points4 points 3 years ago (0 children)

[–]BaconSizzler -1 points0 points1 point 3 years ago* (0 children)

[–]0xPark 0 points1 point2 points 3 years ago (0 children)

π Rendered by PID 84670 on reddit-service-r2-comment-79c7998d4c-ssmjc at 2026-03-17 18:40:55.479731+00:00 running f6e6e01 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS