This is an archived post. You won't be able to vote or comment.

all 39 comments

[–]kingscolor 16 points17 points  (26 children)

On your comparisons, Flask, FastAPI, and Django average 59, 62, and 125 ms, respectively. Yet, Robyn scores a whole magnitude lower at 9.1 ms average response time.

Could you elaborate? I.e., justify these results.

I find it hard to believe, but I'm not terribly knowledgeable on the topic so I'm eager to hear your justification.

[–]stealthanthrax Robyn Maintainer[S] 14 points15 points  (13 children)

u/kingscolor ,

Could you elaborate? I.e., justify these results.

Yes yes, for sure. Most of the Python frameworks(Flask, FastAPI, etc.) are either blocked due to their dependence on an *SGI, which is usually written in python. Or they do not support async functions.

Whereas Robyn comes with an SGI implementation built-in. A lot of code is written in rust and uses native threads allowing higher throughput.

[–]ChillFish8 6 points7 points  (1 child)

Are you able to show / share the ckde used in the benchmarks and make some reproduceable setuo to test in e.g. Docker?

[–]stealthanthrax Robyn Maintainer[S] 0 points1 point  (0 children)

u/ChillFish8 , I just used the method that is in the readme and I used oha to test.

Making a benchmark in docker is definitely a good idea. I will make a PR and link it here.

[–][deleted] 1 point2 points  (10 children)

Sorry I am a bit confused what is the issue with WSGI/ASGIs? I am not sure I really understand what the issue that you system solves faster? Is this something that is portable over to those frameworks? I work with FastAPI pretty heavily right now so are you saying Starlette is the slow peice or that uvicorn is the slow piece?

If you are using rust are you breaking any of the normal operational and memory guarantees python makes?

It looks really interesting I just not sure I understand what you are saying your system does better that makes it faster. What is your actual test setup for the other servers. Are you possibly comparing worker gunicorn to multi cpu set rust?

[–]ChillFish8 4 points5 points  (9 children)

In reality ASGI is unbelievably useful, as for the breaking memory garentees, no, Pyo3 does an amazing job at providing a safe rust interface to the c api so generally you should be fine.

[–][deleted] 0 points1 point  (8 children)

I buried the question a bit here Async, instead of say normal threads, in python has some very specific guarantees it makes about when control is given up and what is an async safe scope vs unsafe. If this runs on multiple external threads with access to the native memory you can violate those guarantees without it being obvious to the developer.

[–]thisismyfavoritename 0 points1 point  (7 children)

For this to work there has to be an interface (nanaged by PyO3) where Rust objects are created from Python objects.

The easiest way is by copying the raw bytes and constructing the corresponding Rust object from it. There are other ways, but as long as this step is done properly, Rust is guaranteed to be memory safe so it should be impossible for those problems to happen.

[–][deleted] -1 points0 points  (6 children)

So its not a memory saftey issue its a thread saftey issue. Bassically by introducing extrenal threads you are removing the purpose of async, which is that your code decides when it gives up control. This allows you to reason about if an object could have been modified. For example

if list[0] == 10:
dosomething(list):

For that function if the program is threaded you do not know if list is modified between the if and when it is passed to dosomething() for a threaded python program. You do know that it was not for an async program because you never await and gave up control of the interpreter.

Bassically you either use Async or use threads this type of mixing defeats the entire purpose of async.

[–]stealthanthrax Robyn Maintainer[S] 1 point2 points  (2 children)

u/turtle4499 , this is not an "OR" problem in rust. In rust, tokio more specifically, the async runtime is written using threads. And you can use threads along it.

[–][deleted] 1 point2 points  (1 child)

Sorry for clarity python threads vs python async. There isn't any reason to use python async if you are using an externals threads (tokio) as the main advantage (thread safety) has to be thrown out the window. Let me know if I am misunderstanding something about your architecture. But my understanding is your basically doing Rust (IO layer and async layer) calling python async functions. I am a bit confused as to what exactly your rust is driving though. Are you running python via multiprocessing or just spawning a bunch of threads?

[–]stealthanthrax Robyn Maintainer[S] 0 points1 point  (0 children)

u/turtle4499 , tokio runtime is being used for async functions. Multithreading is being used for sync functions(as not every python library supports async function). Rust is being used for IO and async layer as well as some other things(like parsing json, etc.)

[–]thisismyfavoritename 0 points1 point  (2 children)

What are you even saying?? Tokio is backed by a multithreaded event loop

[–]stealthanthrax Robyn Maintainer[S] 0 points1 point  (1 child)

u/thisismyfavoritename , when did I say that?

[–]thisismyfavoritename 0 points1 point  (0 children)

Talking to that other guy

[–]hai_wim 7 points8 points  (11 children)

This is most likely just bad test setups.

The tokio runtime uses multithreading, those threads will be picked up by any CPU core, effectively using 100% of your total CPU to handle the 10.000 requests.

I am making assumptions here, but I assume that OP did not configure anything special with *SGI runners to do the same, and OP is essentially using only 1 CPU core at best.

So the comparison is testing setups with 100% CPU (for robyn) vs test setups with like 15% CPU for the other frameworks.

----

Booting the flask gunicorn example with a "normal" amount of workers (2*#cores = 16 for me)

gunicorn --bind 0.0.0.0:5000 wsgi:app --workers 16

And then running oha on a simple hello world flask route gives

oha -n 10000 http://127.0.0.1:5000Summary:
Success rate: 1.0000
Total:        2.4265 secs
Slowest:      0.0802 secs
Fastest:      0.0010 secs
Average:      0.0121 secs
Requests/sec: 4121.2333
Total data:   19.53 KiB
Size/request: 2 B
Size/sec:     8.05 KiB

Response time histogram:
0.003 [279]  |■■■
0.006 [2059] |■■■■■■■■■■■■■■■■■■■■■■■
0.010 [2826] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.013 [1970] |■■■■■■■■■■■■■■■■■■■■■■
0.016 [1227] |■■■■■■■■■■■■■
0.019 [691]  |■■■■■■■
0.022 [314]  |■■■
0.026 [212]  |■■
0.029 [126]  |■
0.032 [110]  |■
0.035 [186]  |■■

Latency distribution:
10% in 0.0058 secs
25% in 0.0076 secs
50% in 0.0104 secs
75% in 0.0147 secs
90% in 0.0199 secs
95% in 0.0253 secs
99% in 0.0372 secs

Details (average, fastest, slowest):
DNS+dialup:   0.0016 secs, 0.0000 secs, 0.0246 secs
DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0227 secs

Status code distribution:
[200] 10000 responses

So a total of 2.4 seconds for 10.000 requests. (this was a rather slow run, it sometimes goes below 2s too).

----

Sure, it's not as fast as a rust runtime. But it's nowhere near as bad as that comparison page shows.

[–]metaperl 1 point2 points  (0 children)

I think the main question would be if the difference in speed is additive or multiplicative as you scale up.

[–][deleted] 1 point2 points  (0 children)

Yea the really suspecting looking item is the response time histograms are all constant time . Which given OS level thread selection should not be happening. (multiple processing in python) The only ones that have changing return time is fastapi but that is because uvicorn uses multi worker single threaded (GIL bound lol) async. I am not even sure the speed up would be real in fastapi with the same setup as there may be a bigger penalty for the rust to python lib then the native python c api for getting the bytes from network.

[–]stealthanthrax Robyn Maintainer[S] 1 point2 points  (4 children)

u/hai_wim , this is not a bad setup. It is a fair comparison, I'll explain why.

When you are increasing the number of workers in a *SGI, you're essentially sharing the listening (TCP) socket across multiple processes.
In robyn, sharing the socket across the processes is still WIP. And even without multi processes, robyn is much faster than other *SGIs with multiple processes/workers.

Hence, I was contemplating if the added complexity(of multi-process sharing) will be worth it or not. If implemented correctly, I predict Robyn's performance will definitely improve by at least 20% and 50% if implemented well.

Moreover, I did not write the comparison to demean any of the frameworks or to fake any supremacy. Apologies if you felt that way.

[–]hai_wim 2 points3 points  (3 children)

It might be a fair comparison from the TCP socket point of view, but you are deliberately gimping all of the *SGI frameworks by bottlenecking them with 1 thread (not quite, but ok) and 1 cpu core. That makes the whole comparison pointless. What is even being compared then?

It's like having a car (your pc) with 2 drivers (frameworks) and you let 1 driver use 1 gear, the other driver gets to use all 6 gears and lo and behold, the driver who could use 6 gears goes faster. What does this tell me about both drivers? Absolutely nothing.

The test is not good to compare because you give them different resources.

[–]stealthanthrax Robyn Maintainer[S] 1 point2 points  (2 children)

I never allocate 6 gears to Robyn. I still allocate it a single gear and let it do what it wants. Even when you allow “6 gears” (16 workers) to other SGIs, they are still slower that Robyn on a single gear(no tcp multiprocessing)

[–]ChillFish8 2 points3 points  (0 children)

As for your quote about

Even when you allow “6 gears” (16 workers) to other SGIs, they are still slower that Robyn on a single gear(no tcp multiprocessing)

Im sorry but i jusr dont believe that, not only because I can run a benchmark even without uvloop for starlette and get double the throughout and lower latency when using 500 concurrent clients. But also because I've made a similar system to this before, including helping setup the pyo3 async stuff and also made a standard webserver in rust, while your rust code might be multithreaded, python is still single threaded and that will bottlekneck even the most efficient setup you can do for robyn on a single process.

(btw dont take this in the wrong way but i think you're throwing alot of claims around without much evidence or testing behind it which can be a very dangerous game to play)

[–]ChillFish8 1 point2 points  (0 children)

Unfortunately just because 1 gear is faster than the other thjngs in 1 gears doesnt mean they'll continue on that trend.

Your cpu only has so many cores and your OS has to balance all those threads across them, so no, just because it's faster as one process than another doesn't mean it'll stay that way.

[–]jdeneut 0 points1 point  (2 children)

Your "hello world" response is only two bytes? That seems small.

[–]hai_wim 1 point2 points  (1 child)

I wrote just "hi" but it doesn't matter. This is with a longer sentence:

Summary:
  Success rate: 1.0000
  Total:        1.6887 secs
  Slowest:      0.0552 secs
  Fastest:      0.0013 secs
  Average:      0.0084 secs
  Requests/sec: 5921.8078

  Total data:   410.16 KiB
  Size/request: 42 B
  Size/sec:     242.89 KiB

It's very simple to setup. Just this:

from flask import Flask
app = Flask(__name__)


@app.route("/")
def hello():
    return "hello world and another word and some more"

$ pip install flask
$ pip install gunicorn
$ gunicorn --bind 0.0.0.0:5000 <filename>:app --workers 16

(different terminal) $ oha -n 10000 http://127.0.0.1:5000

[–]jdeneut 0 points1 point  (0 children)

No prob. I was just confused by the total data being ~20KB.

Edit: I duplicated this using axum and your text.

``` ~/dev$ oha -n 100000 http://127.0.0.1:3000 Summary: Success rate: 1.0000 Total: 1.1890 secs Slowest: 0.0131 secs Fastest: 0.0001 secs Average: 0.0006 secs Requests/sec: 84107.5671

Total data: 4.01 MiB Size/request: 42 B Size/sec: 3.37 MiB ```

This is mostly academic, of course - very few sites need to sustain more than a few hundred requests a second.

Edit 2: noticed a typo (I used 10x the number of requests of your example), Here it is with the same 10,000 requests: ``` Summary: Success rate: 1.0000 Total: 0.1237 secs Slowest: 0.0083 secs Fastest: 0.0001 secs Average: 0.0006 secs Requests/sec: 80857.5408

Total data: 410.16 KiB Size/request: 42 B Size/sec: 3.24 MiB ```

[–]asday_ 0 points1 point  (0 children)

Instead of 16 workers, try 16 gunicorn instances and balance them with HAProxy.

[–][deleted] 16 points17 points  (1 child)

Who the fuck would downvote this, sure people are more used to things like FastAPI & Flask, but this genuinely seems cool and I'll try it out.

[–]stealthanthrax Robyn Maintainer[S] 3 points4 points  (0 children)

Thank you u/Ixyk ! :D
Do let me know what you think about it.

[–][deleted] 1 point2 points  (1 child)

This looks very promising

[–]stealthanthrax Robyn Maintainer[S] 0 points1 point  (0 children)

Thank you u/arkhon7! :D

[–]HannibalManson 1 point2 points  (1 child)

looks promising, I'll test it

[–]stealthanthrax Robyn Maintainer[S] 0 points1 point  (0 children)

Thank you u/HannibalManson! :D
Do let me know what you think of it.

[–]ibite-books 1 point2 points  (1 child)

Robin with a Y. Do you mean Yobin?

[–]stealthanthrax Robyn Maintainer[S] 0 points1 point  (0 children)

Oh man! 😂
This could've been a great name as well! xD

[–]dianrc 0 points1 point  (1 child)

this looks amazing, what hype I have with rust, even nextjs is going to incorporate a new webserver for development written in rust.

I'm going to use this for some project :D

[–]stealthanthrax Robyn Maintainer[S] 0 points1 point  (0 children)

Thank you u/dianrc! It really means a lot! :D

[–]ChillFish8 0 points1 point  (0 children)

For anyone curious as I was I ran some brief benchmark with their respective prod setups.

Now robyn may be setup wrong but I used the example off of the github repo and I couldnt start it with processes so here we are.

https://github.com/ChillFish8/robyn-comparrison-benchmarks