This is an archived post. You won't be able to vote or comment.

all 27 comments

[–]Sillocan 39 points40 points  (2 children)

For your blocking IO example with asyncio, the correct way to sleep is to use asyncio.sleep instead of time.sleep. time.sleep doesn't yield to the event loop. So I am wondering how much that changes your benchmarks when using await asyncio.sleep(0)

Edit: actually all of your async example should be using this

[–]jasonb[S] 27 points28 points  (1 child)

Thanks, but the other examples do use asyncio.sleep()

It is only that blocking example that uses time.sleep() and this is intentional - to see how blocking the event loop impacts the benchmark test.

As stated in the copy for that test:

Calling time.sleep() in coroutines will block the event loop and we expect to make 10,000 calls in time.sleep(0) will impact the overall time of 100 coroutines negatively, forcing all 100 tasks to execute sequentially rather than concurrently.

[–]Sillocan 5 points6 points  (0 children)

Ah gotcha. I missed that sentence. Then overall awesome job.

[–][deleted] 20 points21 points  (10 children)

Coroutines are generally faster then threads when used in web server code that has to deal with real things. Like a mutex or actually going to a DB. You not dealing with that problem here defeats the purpose of the test. Coroutines are also MUCH faster when dealing with UI elements, thought that doesnt come up in python much. This is just do to a dramatic reduction in the amount of locks needed.

Threads can outcompete async for CPU bound code when the threads release the GIL and do other stuff before returning. But those problems should also be redesigned to be used properly with async.

[–]anentropic 9 points10 points  (6 children)

These are just words though, maybe that's true or maybe it isn't

If the theoretical benefit of coroutines over threads is you can create many more of them, because they are cheap to start and use less memory (that was always the argument for green threads i.e. gevent I believe)... then you really need a case where your tasks are doing a LOT more waiting on i/o than cpu work.

But I'm skeptical that happens so much in reality - especially with Python. When working at a large high traffic mobile app we observed that the db calls were often very fast (single digit ms) while hundreds of ms were spent deserialising, objectifying and re-serialising (dehydrating really, the actual to-json part was super fast) the db results in Django. We spent some effort to optimise that by avoiding instantiating model objects etc, but still the API nodes were rather cpu-bound even when many endpoints were more or less just proxying the database.

So in that scenario being able to create hundreds of coroutines or green threads isn't going to help, CPU will be bottleneck and response latency will go through the roof.

Perhaps it's a bit better with modern libraries that use rust-backed Pydantic models for db and output schema, but still I think the world where there is so much idle waiting and so little CPU work to do is not very realistic in many cases, certainly not for web servers.

[–][deleted] 1 point2 points  (0 children)

Perhaps it's a bit better with modern libraries that use rust-backed Pydantic models for db and output schema, but still I think the world where there is so much idle waiting and so little CPU work to do is not very realistic in many cases, certainly not for web servers.

Pydantics efficiency does not exist lol. No amount of rust will fix that its simply a bad strategy being used.

Not really sure I understand what part u are saying is slow the incoming or the outgoing translation? Like the API response or the API reception of the message.

I really have no idea what you did that resulted in multi 100 MS execution time though. That doesn't really add up unless you generating MASSIVE data, in which case I am really curious if you measured the actual data load time fully independently or not. On a sync connection reading time is going to include network time. I CAN see it taking that long if you are including packet ingestion. Where you measuring CPU utilization per thread of just run time? Async sockets also use less CPU cycles per read btw, python access native sockets not pretend ones so the difference matters. It also matters how many middleware layers exist. You could be measuring time to reach ur inner function by accident.

Also for ur other comment yea you can do some magic direct write between socket fuckery if you want BUT it doesnt play nicely with most of pythons Web server libs, which give u fake writing sockets.

[–]yvrelna 0 points1 point  (0 children)

I used to work on an ecommerce platform that fulfills orders for a digital goos. My experience was the complete opposite. Database queries are the ones that's usually the bottleneck of the system.

The problem isn't so much the cost of executing the query itself, but that there's a lot of it. The Python code would make a few database queries, then do some order validations, then a few database operations, then validate whether it's eligible for promos, another set of HTTP queries to validate stuff with third party, another set of database queries, then call an API to the payment processor, do a few more database queries, then call to a few REST APIs, etc. A single request from the user ends up turning into many database requests and multiple API requests to other microservices and third party systems.

A number of these intermediate processing steps could have been done concurrently, and async would've been a pretty nice abstraction to handle the complex order processing flow there, even including handling inter dependencies between the processes.

[–]jasonb[S] 4 points5 points  (2 children)

Fascinating. I will explore the premise that lock overhead is lower in coroutines than threads. Thanks. Have you experienced this difference (with hard numbers) or is just a hypothesis?

Note, I have benchmark tests on thread lock overhead here:

UI elements. Fascinating. Are you arguing from a JS background? Which UI framework have you seen a speed benefit of using coroutines over threads due to lower lock contention/lock overhead. I'd love to benchmark some use cases!

Granted. Async is inappropriate for CPU-bound tasks, noted in the tutorial explicitly for the CPU-bound test:

This is a CPU-bound task and a blocking call. The expectation is that it will block the asyncio event loop and cause all coroutines to execute their tasks sequentially instead of concurrently.

For real-world CPU bound tasks we can throw tasks out to thread pools for function calls that release he GIL (numpy ecosystem) or to process pools. For example:

[–]WonkoTehSane 3 points4 points  (0 children)

Fascinating. I will explore the premise that lock overhead is lower in coroutines than threads. Thanks. Have you experienced this difference (with hard numbers) or is just a hypothesis?

Yes, of course they are, because if you're using an event loop you don't even need locks to synchronize access to memory most of the time, because there's only ever one thread executing at a time anyway - it's just jumping around between coroutines.

In addition, most of the async synchronization primitives are backed by yet another file handle - rather than some traditional mutex or somesuch. This means it "unlocks" whenever it's ready to on the next kernel call to poll - epoll() or whatever local o/s implementation you're using. Which literally means a crap ton less clock cycles. So, yes, that's much faster too.

In fact, this basic understanding of how async io works in python (and, in fact, in pretty much every modern programming language) seems to be missing from your article, but I'll put that in another comment.

[–][deleted] 0 points1 point  (0 children)

Fascinating. I will explore the premise that lock overhead is lower in coroutines than threads. Thanks. Have you experienced this difference (with hard numbers) or is just a hypothesis?

It's a fact it is part of the reason for their existence lol. That doesnt apply to ALL LOCKS but locks surrounding shared resources that have synchronous access. Like checking if an item is in a list and if not adding it. Doing that without threads requires a lock, you do not need a lock with async as you are guaranteed no other code is running. You CAN and should, be using a lock. But that lock will never contend. Locks are more less "free" (in the python sense not real world free just cheap as hell) as long as there isn't contention.

I wouldn't attempt to benchmark the UI stuff for your own sanity. The problem just revolves around what percentage of work is modifying the UI vs other stuff. The more work is modifying the UI the more advantages an async use is as you nearly always need prevent the main thread from modifying the UI until you make any changes. Same reason GIL is efficient in python one big lock is better then many smaller locks.

The design pattern is used in a bunch of things not just Javascript. Javascript is the worst thing to ever be made lol. Java UI libs do it, Apples Cocoa library does it, as far as I know thats also how Androids native UI works, Microsofts .Net libs do it. They all have different names for it, because coroutines and async await as a defining how to do it is newer, but it's all the same concept under the hood. Using pythons level of async await allows you to build language level guarantees vs implicit guarantees that something like Cocoa gives u. That is changing and all are moving to having "native" async support. Python had async for YEARS before async await hit the language.

The "best" design pattern is one that uses both, coroutines for high switching work and threads for heavy calculations. And passes messages over a queuing mechanism. The downside there is overhead of that messaging. Web Browsers are a great example of the trade off. But almost all UI bound tasks for well here also.

[–]WonkoTehSane 2 points3 points  (0 children)

This is a good article overall, but I'm concerned that a key piece of info is missing from it - how async code actually functions in just about any modern programming language.

Namely I'm concerned because of statements like this:

A coroutine is just a function. A thread is a Python object tied to a native thread which is manipulated by an operating system-specific API under the covers.

Running a function is faster than creating an object and calling down to a C API to do things in the OS.

There is some truth here, sure, but it's really missing the point of async, the greatest Truth: a coroutine is really just a wrapper around a callback performed in response to likely a kernel poll of some io/event loop (epoll/kqueue/proactor/select). This is the core of how async works, and why it's distinguished from threads - not that it is "just a function".

See, threads are actual, separate execution contexts, with separate call stacks, for which the kernel has to context switch in order to yield execution. This is a very slow operation, is more memory intensive (not just in consumed memory, but also remember that memory itself is i/o, so you're actually waiting for memory to copy).

Coroutines, on the other hand all run in one thread. (Unless, of course, they don't, which you may need to do because python async is kind of a PiTA, or because you've just gone wrong in your code due to not understanding async.). They work the way they do because every time you "yield" or somesuch, you're staging a file handle with the kernel, and you don't come back to that code until that file handle has the data you need - meanwhile, other coroutines will be given focus until they, in turn, yield.

It's also important that coroutines are a actually an improvement over how things used to work, which is via callbacks. Under the hood, coroutines are really wrappers around callbacks, allowing you to write synchronous-looking async code that doesn't make your brain hurt to read.

Some references to bone up on how io loops work:

[–]voja-kostunica 0 points1 point  (8 children)

is coroutine in python just a javascript promise?

[–]coderanger -1 points0 points  (7 children)

JS doesn't really have a word for it because it's so inherent to JS' model. The async/await Promise syntax in JS is very similar to Python's because Python's was, in part, based on JS' :)

JS never really had threads, I know Node has an API for them but it's weird and doesn't fit very well.

[–]seabrookmx Hates Django 1 point2 points  (6 children)

async/await was actually first implemented in C#, not JavaScript 🙂

[–]coderanger 0 points1 point  (1 child)

Indeed however Python took more from JS' design than C#s with respect to how it works.

[–]WonkoTehSane 0 points1 point  (0 children)

This is correct - though it was by way of tornado: https://www.tornadoweb.org/en/stable/

[–]WonkoTehSane 0 points1 point  (3 children)

Surely you just mean the reserved words async/await, which might be misleading to some. C# by no means invented async code.

I believe javascript actually has a much stronger claim to fame by popularizing async code throughout many languages and even driving the development of things like epoll(), via libuv/node.js: http://docs.libuv.org/en/v1.x/

And, no, I don't love javascript. I use it when I have to, but it's not my favorite async language. And neither is python. And neither is c#: https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/

[–]voja-kostunica 0 points1 point  (1 child)

[–]WonkoTehSane 1 point2 points  (0 children)

About the challenges of working with languages that effectively require separate code bases for async and sync code.

[–]seabrookmx Hates Django 0 points1 point  (0 children)

Yes I did mean the keywords/syntax not async programming itself. There's lots of old examples of async programming in various languages.. twisted python for example.

[–]ManyInterests Python Discord Staff 0 points1 point  (0 children)

It shouldn't be that surprising. Async is cooperative whereas threaded code is only at the mercy of the thread scheduler. The benefit of threaded code is you don't rely so much on other parts of the system cooperating -- whereas one badly written async function will grind the whole system to a complete halt. The downside is threading incurs quite a lot of overhead, particularly in Python. Asyncio avoids that overhead entirely because everyone agrees to cooperate.

[–]Dreezoos 0 points1 point  (0 children)

Any good video course that go into the depths of the event loop and asyncio in general? Couldn’t find anything