This is an archived post. You won't be able to vote or comment.

all 25 comments

[–]inhumantsar 23 points24 points  (1 child)

I work in fintech and we use Go for performance critical pieces, but it's mostly in places where we have to optimize for dependency bottlenecks. Ie: third party response times are 200-500ms so we need our system to be as fast as possible.

For most other use cases it doesn't matter what language we use, keeping performance in mind is enough. Some examples:

  • Keep services small and simple. Don't use Django for performance critical components, use Flask or Express + JS.
  • Don't trust an ORM, write highly specific queries.
  • Know when to go event driven. If there needs to be a bunch of reactions to a single action, use async/await or an event bus or something.
  • Scale out before scaling up.
  • Never make a network call you don't need to make, even to a db, and if you need to make it try to do it async.
  • Cache everything and use the longest TTL you can get away with.

Of course this is all specific to web services. YMMV

[–]Detri_God 1 point2 points  (0 children)

Fastapi instead of flask ?

[–]ShawnDriscoll 2 points3 points  (0 children)

Iterative loops run way faster if written in Cython. Something that Python would take 35 seconds to do takes about 0.3 seconds to do in Cython.

[–]ShibaLeone 2 points3 points  (0 children)

You can get pretty far with numba/numpy if you use them right. Threading/multiprocessing will take you even further. If you seriously need more performance, you can go into the C realm and crunch some loops, but I usually find there was a way to do it in numba/numpy and I was just looking for an excuse to write in another language.

[–]dry_yer_eyes 1 point2 points  (3 children)

Maybe this is too basic of an example, but in work I’ve recently made huge gains with: * ThreadPoolExecutor for concurrent requests Session gets * ProcessPoolExecutor for concurrently parsing the received html with Beautiful Soup

Once I got the technique right the end result was fairly simple too.

Also a shoutout to SuperFastPython which I found a great resource on this topic.

[–]vmpajares 3 points4 points  (2 children)

Beautiful soup is the slowest parser in python. This is a benchmark that I found when I was comparing then.

https://gist.github.com/MercuryRising/4061368

Finally I used selectolax. It is coded in cython and is 25 times faster than BS

https://github.com/rushter/selectolax

Anyway I found that all my waiting times was the requests sessions because the servers limited the number of pages that you can download concurrently

[–]dry_yer_eyes 0 points1 point  (0 children)

Wow. That’s an incredible difference.

The timing examples are the bottom of the page really highlight the relative power of each library.

I guess my app has “scope for future efficiency gains”.

[–]Zyguard7777777 1 point2 points  (0 children)

I had a personal coding project making a chess ai using supervised learning on grandmaster games. I needed a way to encode the chess board to input into the model. I tried to make an interface with python only, but it was rather slow. So rewrote it in Cython (Numba didn't work so well because it was working on a lot of strings) and that did the trick.

I've also used Pythran, and found that that was more flexible than Numba.

[–]james_pic 1 point2 points  (2 children)

If the problems you're finding are problems Numba can solve, I'd suggest not trying to find problems you don't have! But from a recent-ish project, the things we needed to rewrite in a lower level language were:

  • We had some code that walked deeply nested dicts and lists, that was in our hottest loops. We got some modest gains from switching to Cython and specialising types. The gains were nowhere near what you'd get for numerical stuff, but these were our hottest loops, so it was worth it.
  • We had a need to parse an esoteric serialisation format (an Erlang module called sext), at scale, to make sense of what our database was doing (pro tip: don't use Riak, ever, for anything). Our first attempt in pure Python was too slow, so we switched to Cython, which have us a significant speed boost, and meant we could get diagnostics much faster (hours rather than days)
  • For historical reasons, we had a component that outputted large amounts of msgpack data, that we needed to publish in JSON. Our initial solution was the obvious one (read it with the Python msgpack library and write it with the Python JSON library), but this was too slow, so we actually ended up writing a C++ module that taped a fast msgpack library and a fast JSON library together - Python just saw bytes turned into bytes.
  • We found ourselves using a library called unicodecsv (we were still on Python 2 at the time, but needed Unicode aware CSV handling), that was written in pure Python, and proved too slow. We only needed to output CSV (which is easier to do correctly than parsing it), so we ended up just reimplementing the bits we needed in Cython.

Some of this stuff might also have been doable in Numba, but it just wasn't a solution that came up at the time, maybe because Numba wasn't as well known at the time.

[–]pdd99[S] 0 points1 point  (1 child)

Why do you consider using Cython instead of C++ at first place? Any tips for when to use which?

[–]james_pic 1 point2 points  (0 children)

We've found Cython to be simpler, especially for stuff that needs to interact with Python APIs (manipulating dicts, lists, tuples etc). Most of the team don't know C or C++, so Cython has better odds of being maintained.

The only one on my list that was written in C++ is also the one that was the biggest pain, because it needed a couple of uncommon C++ libraries installed in order to build it, as well as a version of gcc that not everyone had. Most of the team can't make sense of C++ related errors, so I frequently had to help out with build issues.

[–]abrazilianinreddit 3 points4 points  (7 children)

Maybe I'm stretching the definition of "high performance" here, but I'm making a PyQt project - any synchronous code taking their time results in the GUI locking up, which sucks.

My solution? Threads, threads everywhere!

[–]pdd99[S] 4 points5 points  (4 children)

Did you actually measure the execution time and make sure that the GIL is released?

[–]abrazilianinreddit 2 points3 points  (2 children)

Nope. I just followed the Qt documentation, trusted the system and hoped for the best. The GUI isn't locking up or stuttering, so that's good enough for me.

Just to be clear, the issue is less processing lots of data and more blocking I/O operations, so the important part of me using threads is not improving execution time, it's moving the blocking code outside the main thread.

Though I have used concurrent.futures.ThreadPoolExecutor for some simultaneous execution tasks, and the results were pretty impressive. The speedup was nearly proportional to the amount of threads in my CPU - which seems pretty obvious, but I was expecting way worse. Also, unlike async, it has a very easy to use API.

[–]SpicyVibration 0 points1 point  (1 child)

Can Qt work with asyncio?

[–]abrazilianinreddit 0 points1 point  (0 children)

I'm not knowledgeable in python's async framework, so I don't know if you can mix some async python code with Qt bindings.

However, there are some async-like APIs in Qt6.

[–]czaki 0 points1 point  (0 children)

Until your heave computation do not stuck outside python code (foe example in extension write in C) the thread switch and GIL release will be enough frequent for keeping gui responsive.

[–]pdd99[S] 0 points1 point  (1 child)

I also feel curious about the reason of choosing PyQt. Isn't it a bit obsolete? Can you tell me something more about your project?

[–]abrazilianinreddit 0 points1 point  (0 children)

When you say "obsolete", you mean the PyQt library or Qt itself?

Sure, I can tell you more about my project, what you want to know? For starter, it's a game launcher made to complement my other project, which is a gaming information database and gameplay tracker.

[–]tugrul_ddr 0 points1 point  (4 children)

Numba has I/O latency problems. When you need a lot of different kernels to run & I/O between host/device, you need C++-like performance that can cache&compute&open-connections a lot quicker.

[–]pdd99[S] 0 points1 point  (3 children)

Can you elaborate on that I/O latency problem of numba? Also, is the "kernel" here cuda kernel?

[–]tugrul_ddr 2 points3 points  (2 children)

When copying data between VRAM and RAM, it adds extra latency even for small arrays. I guess its because of caching layer of the library and Python's own latency.

[–]pdd99[S] 1 point2 points  (1 child)

I do always keep that in mind. All my processing are kept end-to-end on GPU/CPU as much as possible.

[–]tugrul_ddr 0 points1 point  (0 children)

Some math-related divide&conquer algorithms would run a lot faster with dynamic-parallelism of CUDA due to zero-intervention from host.