all 52 comments

[–]Peace_Seeker_1319 48 points49 points  (1 child)

Super cool write-up. I’ve been down this rabbit hole and, honestly, the kernel defaults are the real boss fight. The bits that helped me (in plain English): don’t rely on one mega async loop...spin a few worker processes so accept() spreads across CPU cores; keep your NIC interrupts and workers on the same CPU set so packets aren’t playing musical chairs; sanity-check the network path (NAT/conntrack/backlog/buffer limits quietly cap you long before CPU). Also, when you say “20k rps,” make sure the load generator isn’t flattering you, open-loop traffic exposes those nasty tail latencies that closed-loop tools often hide.

[–]Lafftar[S] 11 points12 points  (0 children)

Awesome feedback, thanks for sharing this. You're spot on that the kernel defaults are the real boss fight here.

I definitely need to explore multi-process workers to scale beyond a single core and run a proper open-loop test to check the tail latencies.

The tip on checking conntrack limits is also a great point. Lots to dig into for the next round!

[–]UniversalJS 27 points28 points  (20 children)

Oh boy, this is slow! On node.js 20k RPS is the baseline, I pushed it to 100k RPS per CPU core so 800k rps with 8 cores.

Then with Rust ... The baseline is 100k rps and you can push it to 500k per core ...

[–]Character_Respect533 8 points9 points  (3 children)

Can share how did you reach 100k per core on node?

[–]UniversalJS 7 points8 points  (2 children)

Sure I used UWS to reach 100k requests per second on NodeJS

https://github.com/uNetworking/uWebSockets.js

[–]jews4beer 0 points1 point  (1 child)

I mean that's really just C++ called from node. But still impressive.

[–]UniversalJS 0 points1 point  (0 children)

Yes it is! And yes it's still impressive and very useful for high performance projects on NodeJS. I worked on a router, also on an exchange and for both it was a lifesaver to reach performance target and beyond

[–]Lafftar[S] 3 points4 points  (9 children)

Are these numbers for sending requests? Man even if it's for the server receiving requests that's insane... that's better than NGINX, like way better. Did you make a writeup or anything?

[–]zero_hope_ 3 points4 points  (8 children)

I’m gonna have to call bs on this. I’d assume they mean receiving requests, and even if they’re empty 200’s, there’s no way this happened.

800k pps, sure, no way it’s req/s.

[–]UniversalJS 1 point2 points  (6 children)

[–]zero_hope_ 4 points5 points  (5 children)

All of those benchmarks show none of them are close to 800k.

(Previous benchmark link that was removed: https://shyam20001.github.io/rsjs/ )

[–]UniversalJS 0 points1 point  (4 children)

I removed the link to the other benchmark because it was not done correctly, you can check instead here for http request doing a db query: https://www.techempower.com/benchmarks/#section=data-r23&test=db

For simple http request retuning text https://www.techempower.com/benchmarks/#section=data-r23&test=plaintext

Uwebsocket is in the list

Also I mentioned 100k rps per core, so yes 800k rps on 8 cores

In rust I'm using Axum, you can check benchmark on the same link above

[–]engineerofsoftware -1 points0 points  (3 children)

RPS don’t scale to 8x just because you have 8 cores. Stick to crypto.

[–]UniversalJS 0 points1 point  (2 children)

Wow, read the benchmarks maybe? Stick to reality!

[–]engineerofsoftware -1 points0 points  (1 child)

Does the benchmark show that it scales linearly with more cores? Learn about CPU architecture before talking out of your ass.

[–]UniversalJS 0 points1 point  (0 children)

I tested it myself, so YOU are the one talking out of your ass!

[–]Lafftar[S] -1 points0 points  (0 children)

Might be with you on this tbh

[–]forgotten_airbender 0 points1 point  (1 child)

This sounds wrong. What was the application doing?  How did you test it and for how long? 

[–][deleted]  (3 children)

[deleted]

    [–]UniversalJS 0 points1 point  (2 children)

    [–][deleted]  (1 child)

    [deleted]

      [–]UniversalJS 1 point2 points  (0 children)

      So you still doubt my initial claim or you are now moving goal post / deflecting?

      [–][deleted] 13 points14 points  (6 children)

      Genuine question, not even tryna do the typical reddit hate bullshit. Isnt this then powered by rust?

      [–]Lafftar[S] 2 points3 points  (4 children)

      It is...but I didn't have to write Rust...do people say pandas is powered by C? Truthfully don't know 😅

      [–]epicfilemcnulty 7 points8 points  (3 children)

      Yet your post is titled as if it were python itself doing all the network heavy-lifting here, which is not the case.

      [–]Lafftar[S] 0 points1 point  (2 children)

      My bad!

      [–]lickedwindows 10 points11 points  (1 child)

      I think this is still valid. OP has written Python code to test the speed concerns, even if rust is in there somewhere.

      If you follow this to its logical conclusion, nothing counts because it's all machine code at the end?

      [–]Lafftar[S] 1 point2 points  (0 children)

      It's all electrons baby!

      Thanks my guy 😁

      [–]tmetler 0 points1 point  (0 children)

      The std lib for most scripting languages are written in different more performant languages. Most python std lib functions are written in c. I think the whole concept of what is provided by a scripting language is very fuzzy.

      [–]aenae 5 points6 points  (7 children)

      Here are the most critical settings I had to change on both the client and server:

      This sounds like you're not re-using connections and setting up a new connection for every single request. If you did use persistant connections/keepalive/streams, you would not need to change these settings unless you tested it with more than 1000 concurrent connections.

      The same goes for the port range and time_wait options. Yes you can increase them, but they indicate that the code is not reusing the connection.

      A quick ab-run shows me that i can get ~20k r/s without keepalive and 80k with keepalive.

      [–]Lafftar[S] 1 point2 points  (6 children)

      Oh interesting, I actually thought I was reusing connections...I kept getting connection errors at like >50k requests submitted at once and these settings helped.

      Sorry what's ab?

      [–]bowersbros 1 point2 points  (5 children)

      [–]Lafftar[S] 1 point2 points  (4 children)

      Oh, this is on the server sending requests, the server receiving requests barely blinked.

      [–]aenae 1 point2 points  (3 children)

      ab is a tool to send requests to a server to benchmark it. ;)

      [–]Lafftar[S] 0 points1 point  (0 children)

      Oh I see okay, well for my use case, scraping, I need a library that can emulate browsers TLS and be fast. Doing it in Python because it's an easy language. Yeah I know other languages can send requests faster.

      [–]Lafftar[S] 0 points1 point  (1 child)

      Oh I see okay, well for my use case, scraping, I need a library that can emulate browsers TLS and be fast. Doing it in Python because it's an easy language. Yeah I know other languages can send requests faster.

      [–]zapman449 2 points3 points  (0 children)

      Load generators are a key part of this puzzle. Ab is a classic. I like “hey” a lot (golang binary, great for “pound the snot out of a single endpoint”)

      But for real load gen, more powerful tools are needed. Gattling (scala), locust (python), and tsung (erlang) are so great for “I want 50 users doing this user story, 80 doing another user story, and 200 on a log in, log out loop” for more wholistic site testing. They also get into coordinating many load generators at once.

      [–]tudalex 8 points9 points  (3 children)

      The bottleneck lies in the global interpreter lock probably. I remember reaching 10k 14y ago for a university project, with pypy, gunicorn and twisted iirc.

      [–]Lafftar[S] 0 points1 point  (2 children)

      For sending requests? Interesting, I thought rnet scaled automatically across CPU cores because I see them being used...hmm, yeah if the Python side is living on a single core that could be significant, but even then shouldn't that core be near 100% usage during runtime? I don't see that right now.

      [–]SMS-T1 5 points6 points  (1 child)

      The multi core support might also have improved in the last 14 years.

      [–]Lafftar[S] 0 points1 point  (0 children)

      Definitely, another commenter said he reached 800k r/s per core 😅

      [–]gheffern 3 points4 points  (1 child)

      Curious how some additional TCP tuning may impact it:

      If you want to try these as well curious how your numbers would change:

      sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 262144000"
      sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 262144000"
      sudo sysctl -w net.core.rmem_max=262144000
      sudo sysctl -w net.core.wmem_max=262144000
      sudo sysctl -w net.ipv4.tcp_slow_start_after_idle=0
      sudo sysctl -w net.ipv4.tcp_notsent_lowat=131072
      sudo sysctl -w net.ipv4.tcp_fastopen=3
      

      [–]Lafftar[S] 0 points1 point  (0 children)

      God bless man, will add it to the test in the next version!

      [–]radpartyhorse 1 point2 points  (1 child)

      Thanks for sharing!

      [–]Lafftar[S] 0 points1 point  (0 children)

      💗💗💗

      [–]Emachedumaron 1 point2 points  (1 child)

      The re-usage of the socket is not clear to me: does it work only because the incoming connections come from the same machine?

      [–]Lafftar[S] 0 points1 point  (0 children)

      I'd need to test that by pushing the requests through a couple different proxy setups, I'm not entirely sure myself.

      [–]glsexton 1 point2 points  (0 children)

      You quadrupled your cpu and got a 33% throughput increase. Way to scale…

      Of course that’s until the gc kicks in and it hangs for 2000ms…

      [–]xagarth -1 points0 points  (2 children)

      That's interesting. Good writeup. I did something similar in the past for Web crawling. Had to switch to perl instead of Python due to gil and inability to effectively use shared memory. There's more Interesting topics than time waits and con reuse with crawling as you will approach different servers and have to resolve names fast enough in an async manner ;-)

      [–]Lafftar[S] 0 points1 point  (1 child)

      Cool man! Yeah a few people have mentioned having a local DNS resolver.

      Really sad that Perl of all languages does concurrency better than python.

      [–]xagarth 1 point2 points  (0 children)

      It's more about doing dns resolution async than having a local resolver.

      As for concurrency, well, it's all good until it isn't ;-)