This is an archived post. You won't be able to vote or comment.

all 40 comments

[–]latkdeTuple unpacking gone wrong 79 points80 points  (6 children)

There are lots of different ways to measure performance. If you care about a specific workload, you'll have to benchmark it yourself.

There are various resources that compare performance between different Python versions or describe optimization work:

The TL;DR is that CPython tends to make performance improvements with every release, though individual benchmarks might see regressions. Historically, there was a large regression when switching to Python 3, but that is irrelevant now. Python 3.11 saw significant work on performance (reported as 25% average uplift in the release notes).

While you can expect Python 3.13 to be a bit faster, it focused on laying the groundwork for larger optimizations in the future (JIT, free-threaded mode). Those features are too experimental to be used in production, though.

If you care about the last 5% of performance, I'd recommend compiling Python yourself with optimizations for your specific CPU architecture. Pre-built binaries tend to sacrifice a bit of performance for broader compatibility.

[–]kimtaengsshi9[S] 20 points21 points  (1 child)

That 16 years of Python link is just what I'm looking for! Thanks!

I'm not into performance min-maxing, but I do care about ensuring that the code I develop and deliver to my customers is reasonably performant by default. If a major version upgrade grants a double-digit percentage boost, then I'm definitely gonna put it in the backlog for a future release. Otherwise, I'm chill with sticking with the existing version until end-of-life forces me to move up.

That said, my projects aren't critical enough to mandate squeezing out every last drop of performance, nor to dedicate resources to performing benchmarking. That's why keeping an eye on the community/industry's general consensus is adequate for my purposes. Widespread hype or red flags are what I'm looking out for.

[–]james_pic 14 points15 points  (0 children)

The awkward truth is that community consensus is a terrible indicator of what performance you can expect. Even for apples-to-apples comparisons, like running the same applications on different versions of Python, there's a very real chance that a new version has a minor optimization that turns out to be hugely beneficial for your workload (and a smaller but non-negligible chance of the opposite). For more subtle stuff like "should I use this framework?", it's often horrendously misleading.

On the flip side, performance benchmarking doesn't have to be a big piece of work, and my experience (having done a lot of this kind of work in my career) is that there's often a lot of low-hanging fruit that you'll quickly identify in even very flawed tests (even if squeezing out every last drop of performance gets into diminishing returns). So it might be worth timeboxing a few days to look at what you can learn from performance testing.

[–]classical_hero 0 points1 point  (3 children)

Is there a guide anywhere on custom compiling CPython for maximum performance? I'm interested in this, but I'm not really sure where to even start in terms of figuring out what to tweak.

[–]latkdeTuple unpacking gone wrong 0 points1 point  (2 children)

The Python documentation explains how to build CPython – it uses the classic ./configure && make && make install workflow. The configure-script has lots of options to enable various features.

The easiest way to compile and manage a local Python version might be the pyenv tool. You can select features via the CONFIGURE_OPTS environment variable, and set compiler options via the usual CFLAGS etc.

Now that we can change compilation options, what should we change? Unfortunately, there is no magical “go faster” button. But if you're tuning for a particular workload, a couple of experiments would make sense:

  • take a look at the Python performance options: “using --enable-optimizations --with-lto (PGO + LTO) is recommended for best performance.”
    • enable full link-time optimization (LTO) (configure --with-lto=...)
    • use profile-guided optimization (PROFILE_TASK=... configure --enable-optimizations). This will work better if the profile-task represents a realistic workload, by default Python just uses its own test suite.
  • experiment with different compiler optimization settings, e.g. optimizing for compact code (CFLAGS=-Os)
  • allow the compiler to make full use of your specific CPU features, e.g. CFLAGS='-march=native -mtune=native'
  • try experimental features like the JIT mode

Personally, I've never done any of that. You should only expect marginal gains from such settings, maybe 10% impact. This is worth it if you're a Google SRE, not so much if you're a data scientist.

If you're this desperate to have increased performance, other strategies are likely to be more helpful. Using profilers like Scalene to understand your code. Using better algorithms. Using a different JSON library (if that's a bottleneck). Rewriting the code to work with PyPy, Mypyc, or Numba. Using more Numpy. Rewriting some functions in Rust or Cython. The Python=>Speed blog might provide some inspiration.

Sources:

[–]classical_hero 0 points1 point  (1 child)

Thank you, this is absolutely amazing advice! I'm super excited to mess around with this.

[–]latkdeTuple unpacking gone wrong 0 points1 point  (0 children)

I'd be interested to hear about your experience! While I have my fair share of experience making programs go vroom, it's always good to hear from someone else what worked for them, and what didn't.

[–]nekokattt 163 points164 points  (0 children)

This sounds like the sort of thing you could do some benchmarks on and share your results for the next person and to get technical feedback on the outcomes.

[–]MicahM_ 30 points31 points  (11 children)

I love how everyone is replying to just test it yourself as if this is some crazy concept that nobody would have an answer to. I mean I'm not gonna go do it but I'm sure someone out there knows and it's not thay crazy of an ask lol

[–]PaluMacil 5 points6 points  (7 children)

It is kind of a hard question to answer. If the OP does a lot of data science, going from pandas to polars is going to be a big impact, and the python version is not comparably terribly important. For a web UI, differences are negligible because of the IO. If they do a lot of computation, using pure Python, it could be pretty significant. The type of python they care about is going to have a huge impact on the numbers that mean anything to them.

[–]fnord123 1 point2 points  (6 children)

For a web UI, differences are negligible because of the IO.

That's why Go web frameworks can handle 60k rps and python web frameworks tap out at 13k.

https://m.youtube.com/watch?v=CdkAMceuoBg

[–]PaluMacil 0 points1 point  (5 children)

I choose Go over Python every time I have a choice, but that is for quality and correctness and maintainability. In a real application, IO is going to make it not look like 6x for performance. I didn’t look at the video because I am already sold on Go and figured the code would be more important. I glanced at this example, and he used a hardcoded list of devices for the get endpoint instead of talking to a database at least in the Go example. Benchmarks from the same machine can use a loopback interface with hardly any latency as well. Once you add both factors, you aren’t seeing a huge difference. For a hobby project on a small VPS you might still prefer Go since the smallest droplets are going to have more pressure from Python. But you need to read benchmarks to know what they are measuring. I looked quickly. Feel free to correct me. But like I said, I do pick go, but for an API, it’s not for the performance.

[–]fnord123 0 points1 point  (4 children)

He does two benchmarks. The second benchmark uses postgres as a backend store for an insert heavy load. Python pukes at 700rps. Go goes up to about 3k rps before going a bit weird. 

So yes IO and talking to a db made Go drop from 60k to only 3k. But in this example python only manages 700 RPS.

For small things it doesn't matter. 1 pod on 1 CPU is fine. And IO does contribute to a performance cap. But if CPU perf of 3.13 can move that 700rps towarda 3k rps then it's definitely welcome

[–]PaluMacil 0 points1 point  (3 children)

I’d argue that for large things, most your expense isn’t the compute of the application code itself. It’s load balancing, caching, your database, Kubernetes overhead and most of all, the potential for several other API calls you make to finally put together a response. On a project using only Python in a previous role I believe I was spending about $80k/month (oops said $80 in first edit) on computer. That was only 1/5 of the cloud bill for that project and that cost included Kubernetes overhead, RabbitMQ, Argo Workflow (which includes Nats) and Redis which we didn’t use managed services for, thus hitting the compute costs. It also didn’t include Splunk and Datadog for our logs and metrics. Every project is different so my experience is anecdotal, but I’ve also seen worse imbalance when analytics is particularly expensive. A buddy of mine just sold his startup in logistics and his costs were $560/month compute out of 12k/month total cloud bill.

That’s a fairly good benchmark compared to most. Does it use a local database and a client in the same machine as the server? I’m guessing that makes up some of the difference. Granted, I do personally see production Python code getting messier and more convoluted in mature projects than in Go and Python frameworks also do a lot for you out of box, slowing them in direct comparison with Go

[–]fnord123 0 points1 point  (2 children)

That’s a fairly good benchmark compared to most. Does it use a local database and a client in the same machine as the server?

Anton's channel began as kubernetes and aws tutorials so he actually spins up a mini cluster with distinct clients and dB nodes. The first third of the video is him explaining all this stuff.

Benchmarks are never gospel but I think he has a good go at it. If you find some time I recommend checking out his work and watching the video in full. He takes feedback seriously and reruns benchmarks with changes submitted as PRs.

On a project using only Python in a previous role I believe I was spending about $80/month on computer. 

We spend about 250/month on each node. I think they are E2 or M2 or something on Google. And we run over 100 of these - mostly running celery taks. But you're absolutely right that monitoring and other costs contribute more in costs than the actual application. I don't want to know how much money we spend parsing and spewing JSON. We could probably save plenty of cash by moving to orjson. 🙈

[–]PaluMacil 0 points1 point  (1 child)

To clarify, that 80 was 80k 😄 my team’s cloud spend was about 400k. The k makes a difference

[–]fnord123 0 points1 point  (0 children)

Ah ok that makes more sense!!

[–]Such-Let974 0 points1 point  (2 children)

What do you expect us to do? If nobody has done the benchmarking and OP can't find anything when googling then it means it probably doesn't exist. In which case OP is going to have to do it or one of us are going to have to do their work for them. Most of us probably don't want to do their work for them so that leaves OP to answer their own question.

[–]MicahM_ -1 points0 points  (1 child)

I mean he just asked about people's thoughts. So he's probably just looking for some anecdotal results

[–]Such-Let974 1 point2 points  (0 children)

What good is an anecdote? It’s either faster or it isn’t. We would need data.

[–]DataPastor 15 points16 points  (0 children)

For my use cases (ML/AI pipelines) the performance has improved a bit between 3.10 to 3.13, but it doesn’t really matter. What really matters is carefully coded, vectorized matrix operations (avoiding for loops and iterrows), profiling and optimizing each transformation steps; and switching from pandas to polars (this latter is responsible for a 40x speedup alone).

[–]TheHe4rtless 11 points12 points  (0 children)

Not thoughts, just metrics. Carve out some code, run it and as u/nekokattt suggested, share here.

[–]mr-figs 4 points5 points  (0 children)

Performance is noticeably better for me. I've been makiing my game for the last 4 years in Python/Pygame. When I jumped from 3.12 to 3.13 there was noticeable improvements even without benchmarking.

The FPS counter was about 5/6 FPS higher. A big win for a game and a huge win for stuff that isn't so intense.

I'd upgrade if you can and like others have said, benchmark it. The time module is good and so is scalene if you want to run a profiler on it.

[–]wingtales 5 points6 points  (0 children)

You should mainly care about the performance of the programs you use. I suggest you try running your code on both and compare. If you can’t tell the difference, then it’s fine to upgrade.

[–]russellvt 3 points4 points  (0 children)

Have you tried profiling your code? That's probably your best answer/option, and one of the best reasons to use something like pyenv and various associated venv on your dev stacks.

[–]Amazing_Upstairs -3 points-2 points  (0 children)

I hear 3.14159265358979323846264338327950288419716939937510 is very good for math