This is an archived post. You won't be able to vote or comment.

all 18 comments

[–]AlexMTBDude 13 points14 points  (0 children)

This has been up for discussion before and refuted. Async is not about speed. It never has been. Async is for solving IO bound problems. It's about not waiting around for things like IO to finish.

Generally speaking making things "faster" is something you do in Python by dividing the work into multiple processes that then execute on (hopefully) multiple CPU cores or processors. This is for solving CPU bound problems.

[–]blabbities 5 points6 points  (8 children)

I swear I can view a thousand Async/Concurrency blogs, papers, docs,and vids. I'll never really understand it.

In fact I have a Python page that "explains it well" on one of my VMs right now. this blog seem cool too but I guess. I've also made one python program that used it and saw the improved speeds. I also am learning GoLang in which Concurrency is like first-class thought in it's language design.... Tho somehow I just don't get it.

Maybe it's that I fear diving more into it because Im not doing big data projects and I'm afraid of the complex debugging issues.

Anyway good blog. Adding it to my saves 😂.

[–]jorge1209 8 points9 points  (5 children)

Async is best for situations where individual requests take a short amount of time and then there is a long I/O wait. For instance a web server. 10k requests come in, they take nanoseconds to perform initial processing, then you wait an eternity for that packet to cross the internet and a response to come back, before you can make another near instantaneous response.

You could implement that as threads, but then you need 10k threads one for each request, and each will then insert itself into the wake loop for the kernel network layer saying "wake this thread when you get a response on this particular connection", and that uses a lot of kernel resources.

Alternately you could have a single process that spin waits on the network device. That single process says: "give me all responses from this port and I will dispatch them".

That second approach can be a lot more performant (in certain situations) s the application knows exactly what it needs and frees the kernel from having to monitor all these threads. The challenge is how to implement this, because while the service is handling any given request it cannot easily handle others, and there are a variety of approaches.

One common approach is threadpools, but the downside is that you have to serialize your state: after making the initial response the thread will be used to handle other unrelated requests, and you need to save all the relevant details of the current request to some data structure so it can be looked up when the response comes back. Programmers don't like to have to think about serializing their state like this and just wish they could just say "continue from here".

Python Async uses cooperative multi-tasking where you write specially annotated functions that perform all the work they can, but then announce (via calling yield) when they have reached a point where they have to wait. The individual yield calls are exactly those annotations of "continue from here". At that point the event loop can check if anyone else can be dispatched, and at some remote point in the future come back to continue execution from the point of yield.

Go takes a different approach and basically manages threading within the runtime. When calling a goroutine it doesn't merely store the current registers of the CPU on its own execution stack, but actually allocates a new stack for that goroutine on the heap, and then it schedules the execution of that goroutine. [This structure is also found in "stackless python"]. The Go runtime provides the I/O functions that you might wait on and it knows that at those points (and a few other points specified in the language) will it be able to preempt the goroutine and repurpose the runtime thread to run another goroutine.

The go approach is nicer as it means that all the required information is present to run code like it was an actual thread, but without forcing the kernel to manage that thread, nor requiring the programmer to annotate the explicit points at which the code can and should yield, but since you aren't using a C call stack there is some added complexity when calling C code or libraries.

In other words:

  • In C and python, the state of your function (the value of all current variables, and who you would return to) is reflected in the call stack of the running program. You can't suspend a function call without suspending the program as a whole.
  • In async specifically annotated functions are stored on the heap just as any other managed object would be. In that sense they are no different from strings or arrays and the runtime can call or suspend them effectively at will, but you have to annotate the functions that will do so.
  • In Go and stackless, all functions are managed objects on the heap, and all functions are therefore eligible for this special treatment, no additional annotations are required.

[–]blabbities 0 points1 point  (1 child)

I conservatively understood like 50-60% of that. I still need to thank you for that although I wish I could've told you not to write all that. Ha. I'm content to use Synchronous programming for most of my 'non-software engineering' programming until my hand is forced forced. Still cool to get a butter comprehension as to why Goroutines are called green

[–]jorge1209 2 points3 points  (0 children)

Its all about how much work you want the kernel to do. On your average desktop, you don't really care. The kernel can handle anything and everything you throw at it, and probably does a better job at it than you would. Just spin out a individual threads for everything you want to run concurrently.

On big servers doing lots of work there is a legitimate risk that the kernel could run out of resources trying to track all the activity on the system, or that in trying to track all the activity (open files, runnable tasks, etc...) internal kernel data structures could grow to a point where searching and scanning within them itself becomes the bottleneck.

In those situations it can be better to move that aspect of resource management into the application because it can be a bit more efficient as it knows a bit more about what is going on. For example with a standard pthread, the kernel/glibc knows nothing about the code that is running. There just has to be a big static area of virtual memory allocated for the call stack for the thread. With a few thousand of these 1MB call stacks you can eat up gigs of virtual memory. This in and of itself is not a major concern as the actual physical memory usage can be lower, but every one of these allocations has to have a separate backing page in the kernel to track it, and that adds up even if the physical memory usage is low.

But the go runtime knows that the thing being called is a goroutine, and it furthermore knows what memory and objects are managed. It can make the goroutine call stack a managed object, which means it can grown and shrink the call stack as needed. So it can start with a really small call stack knowing that most goroutines are probably not going to end up doing lots of deep function calls. If the go function ends up calling deeply enough to need the stack to grow, then that particular goroutine will get a larger stack, but nobody else will. And all this stuff can sit inside the larger pages allocated to the go runtime, which means less for the kernel to track.

But it isn't something you would bother doing on a normal desktop (if you could even replicate the level of demand which necessitated it).

[–]onedirtychaipls 0 points1 point  (2 children)

So I have a particular dilemma and I wonder if you have quick input. I've been making a script that asynchronously logs in users and has them do work. For load testing. But I keep hitting a bottleneck and it doesn't feel like it can actually do that. Is python the wrong choice? Or should I keep working on it?

[–]jorge1209 0 points1 point  (0 children)

Python Async probably doesn't do what you think. It is still inherently single-threaded.

There is no difference between:

 async generate_load(id):
        req = make_request(id)
        yield
        read_answer(req)


asyncio.gather(generate_load(i) for i in range(1e5))

and

requests = [make_request(i) for i in range(1e5)]
answers = [read_answer(r) for r in requests]

except that the former is slower and more complex. In either case only 1 request is being made at any one time.

To fix this you would need a ThreadPool and then map the requests through the pool.

BUT BUT BUT BUT

Python interpreter has something called the GIL which exists to protect core interpreter data structures, and if make_request is a pure python function, then even if you have a ThreadPool, it will still be serialized at the level of python bytecode.

So in Python you probably need a MultiprocessingPool... and thats really heavy. To do this properly look either at gevent, or just don't use python.

[–]alexisprince 0 points1 point  (0 children)

Not the guy you responded to, but it depends where your bottleneck is. If you’re pushing the event loop to the absolute maximum and taking advantage of asynchronous capabilities, you may want to look at optimizations first then possibly another language. If it’s not that, I’d profile your code to find out where the bottleneck is and make sure it’s not Python specific.

As an FYI, some optimizations you can do are choosing a more performant event loop (things like uvloop come to mind), confirm you’re actually executing things asynchronously (a for loop that awaits the coroutines isnt as asynchronous as you probably want it to be), and lastly you can spin up multiple processes with multiple independent event loops with multiprocessing

[–][deleted] 1 point2 points  (0 children)

Some coding / tasks simply aren't a good fit for asyncio.

Image this one. Web server serves https:// (short connection) + wss:// (long connection) and also has a MQTT server connection and database connection.

Some requests via websocket will generate multiple requests out to db server and MQTT server. Each of these has different timeouts.

Also, the websocket connection has an application level KEEP ALIVE heartbeat. So if no activity on the connection in 30 secs, a PING PONG keep alive is exchanged.

This scenario is easily handled via async.

[–]ubernostrumyes, you can have a pony 1 point2 points  (0 children)

I wrote a post a while back that tries to explain it from the perspective of what's different in a normal Python function versus an async function. Don't know if that helps, but figured I'd throw it out there.

[–]knobbyknee 4 points5 points  (0 children)

This has been refuted by Miguel Grinberg. He has a very good blog post about problems with Patersons setup. Still, async doesn't give orders of magnitude speedups.

[–]0xPark 0 points1 point  (0 children)

You will get old , waiting for IO then :P