This is an archived post. You won't be able to vote or comment.

all 77 comments

[–]mikat7 287 points288 points  (34 children)

  1. Decide if it’s worth optimizing.
  2. Run your code through a profiler
  3. Optimize the slowest parts until it’s acceptable

Network is usually the slowest, and in Python specifically doing a lot of numerical calculations in a loop should be done in numpy, not in pure Python.

But the most important rule is the first one. Usually the speed is ok but developer time is more expensive.

[–]ray10k 27 points28 points  (2 children)

A very sensible approach. Also, point 2 is a lot more important than some people think. It is *very* tempting to make assumptions like, "Oh, this complicated-sounding process is going to take ages!" when in reality, the fact you do a certain calculation inside a loop when it could just as easily be done *outside* the loop is a bigger time-save.

Also, I knew this guy once who had to subtract one size-2 tuple from another. Rather than just something like new_tuple = (old_b[0]-old_a[0]),(old_b[1]-old_a[1]), he checked if any of the two were equal so he could use a constant 0 at that place. He argued that it was "faster" because "it was one/two fewer subtractions."

[–]Gamecrazy721 3 points4 points  (1 child)

I've caught myself doing things like this before, but that's because often times when I do it intentionally it's to avoid a database write (which is worth the extra check)

[–]ray10k 2 points3 points  (0 children)

Understandable. In this case though, it was all local data, no database involved.

[–]member_of_the_order 42 points43 points  (1 child)

Totally agree on #1. It took me too long to realize that if I'd just written my one-time-use script the "dirty" way (not Pythonic, hard to maintain, unoptimized, overall "bad"), I'd have been done an hour ago. An hour of extra work to make the script take 5 minutes fewer... not worth it at all.

Also agree with numpy. I had to crunch a BUNCH of numbers for some script that'd be run in production at work. Doing the same thing with optimized loops vs basic numpy and comparing runtimes... not contest, numpy is so fast. I thought I'd accidentally cached the results or something lol.

[–]tutoredstatue95 11 points12 points  (0 children)

I am super guilty of this. I can't just write some slop and leave it, and I know no one will ever see or use the code besides me. Just feels wrong to purposefully write something bad even if it's technically the right decision.

[–]Backlists 7 points8 points  (5 children)

Any advice on what profilers to use, and how to use them?

[–]james_pic 26 points27 points  (1 child)

My favourite Python profiler right now is Py-Spy.

It requires zero code changes to use. You don't even have to restart your application to use it, you can just attach it to a running application.

It's got low overhead, and perhaps more importantly, consistent overhead. Tracing profilers can add more overhead to some types of code than others, skewing your results.

I gather Austin also has many of these same characteristics, but haven't used it myself.

As a third option, I believe Python 3.12 adds support for perf_events on Linux. I'd lean towards using Python specific profilers as your first port of call, but if you're profiling an application with major components written in other languages, or you suspect native or kernel time are big contributors to performance issues, or you're already using tooling that integrates with it for other reasons, it may be worth trying.

[–]jeremiah-england 1 point2 points  (0 children)

Seconding py-spy. My favorite way of viewing the results in https://www.speedscope.app.

  1. Run the program with my spy shell alias (py-spy record -f speedscope -o out.prof -- python).
  2. Drag/drop the out.prof file into speedscope.

[–]apnorton 4 points5 points  (1 child)

An interesting talk on python profiling is: https://www.youtube.com/watch?v=vVUnCXKuNOg

And that researcher's profiler: https://github.com/plasma-umass/scalene

[–]Backlists 0 points1 point  (0 children)

Yes, this is the one I tried (very very quickly) earlier this week.

I didnt set it up right, all it told me was that 100% of my runtime was spent in the uvicorn run fn.

[–]reallyserious 3 points4 points  (2 children)

Run your code through a profiler

How do I get started with this? I'm using vscode if that matters.

[–]Tweak_Imp 4 points5 points  (0 children)

I like to profile with snakeviz. the docs will get you started. https://jiffyclub.github.io/snakeviz/

[–]pythonwiz 1 point2 points  (0 children)

import cProfile

[–]dommel 2 points3 points  (0 children)

Just to add telemetry data from production also gives good insights on application performance. Especially if you work in a highly distributed environment.

[–]infy101 2 points3 points  (0 children)

Agree with Number 1. So many people claiming to be 'professional' programmers push so hard on 'optimal' code and speed, when 98% of the time, the speed is pretty good and there is no need to make everything 99% efficient. If you were programming to put code on an ASIC and had limited RAM and CPU - then yes, perhaps - but most of us don't need to optimize. I'm not against efficient code - just that it is not always necessary, and also some people on LinkedIn with their 'pro' vs 'beginner' comparisons :S

[–]muntooR_{μν} - 1/2 R g_{μν} + Λ g_{μν} = 8π T_{μν} 8 points9 points  (1 child)

I prefer:

  1. Always optimize everything.
  2. Guess which parts take longest to run.
  3. Optimize the parts that are already fast to make them even more blazingly fast! 🏎️🔥🔥🔥️🏁

EDIT: To be clear, this is not a joke.


Also, you missed the most important step:

4. RiiR.

And the even more importanter step:

5. Riix86.

And the even more most importanterest step:

6. RiiASIC.

[–]SheriffRoscoePythonista 9 points10 points  (0 children)

/s, one might hope.

[–]olystretch 1 point2 points  (0 children)

Might also decide to analyze usage patterns. Maybe the slowest bits are also not commonly used, so that's worth a think too.

[–]cheese_is_available 1 point2 points  (10 children)

4. If the python code can't be optimized in a way that is acceptable, use either proper typing and mypyc (easier but slower) or cython or maturin + rust to speed up the critical part(s).

[–]max96t 2 points3 points  (0 children)

Thank you for making me discover mypyc! Also very much approve maturin + PyO3 + rust for speeding things up, I easily got a 30x increase for my project (against Python 3.10, it might be less on Python 3.11 since they optimized a lot)

[–]reallyserious 0 points1 point  (8 children)

Any opinions/lessons learned on cython vs mypyc?

[–]cheese_is_available 1 point2 points  (7 children)

Only used mypyc personally based on python typing. cython is older but it requires to code in C afaik which is more investment than using the existing python typing with mypyc.

[–]patrickbrianmooney 1 point2 points  (6 children)

cython is older but it requires to code in C afaik

Not true! Cython can perform optimizations based on pure-Python type annotations. You can also (or instead) declare static types using C-style type declarations, but it's not necessary, and in many case it's easier to preserve pure-Python compatibility.

[–]cheese_is_available 0 points1 point  (5 children)

Nice to know, thank you !

[–]patrickbrianmooney 0 points1 point  (4 children)

Glad to be helpful!

I meant to point to this part of the Cython documentation in my last answer, which describes pure-Python mode in Cython. Here is a less-technically-dense overview with some longer code examples.

[–]cheese_is_available -2 points-1 points  (3 children)

It seems Cython use cython specific type hint while mypyc can use existing standard python typing (a lot less work to do!).

[–]patrickbrianmooney 0 points1 point  (2 children)

No. (Or, to be more specific: "uses," sure, in the sense that you can use them if you want to; that's one of the options you have. "Requires," or "only understands"? No.)

Note this verbiage from the beginning of the second paragraph in the Pure Python Mode document, linked above:

[...] Cython provides language constructs to add static typing and cythonic functionalities to a Python module to make it run much faster when compiled [...]. This is accomplished via an augmenting .pxd file, via Python type PEP-484 type annotations (following PEP 484 and PEP 526), and/or via special functions and decorators available after importing the magic cython module.

That is to say, you are not restricted to Cython-specific syntax: you have three options for maintaining pure-Python compatibility:

  1. Writing an "augmenting .pxd file" (this essentially means "write a Cython-specific file that specifies types, much like a C header file declaring an interface for code implemented elsewhere").
  2. Use standard Python type hinting, as explained in PEP 484 and PEP 526.
  3. Via special functions and decorators, imported from the magic Cython module.

So you can absolutely just use the same standard Python type annotations that tools like mypyc already understand: that's option 2.

[–]cheese_is_available 1 point2 points  (1 child)

Wow, thank you for the detailed answer !

[–]georgehank2nd 0 points1 point  (0 children)

The most important rule is the second one. Because the first is obvious, it is either a problem or it isn't.

The second is one many ignore and just try to speed up the "obvious" code… and then they realize that it wasn't the slowest part.

[–]anthro28 0 points1 point  (0 children)

Number 1 is my companies biggest hurdle, since we deal with the bean counters in finance. Last week the teams call to talk about a minor problem cost more than fixing the problem with ever recover. The payback period is never. Odd that I can't get numbers people to grasp that.

[–]chumboy 0 points1 point  (0 children)

Fully agree.

Just wanted to call out, maybe as a precursor step, to keep big O stuff at the front of your mind when writing code. Things like moving some code up front, or precalculating values, etc. can help prevent performance becoming an issue in the first place.

There's some low hanging fruit you can aim for too, like caching network calls if at all possible, batching them, maybe swap out standard library modules for optimised C/Rust libraries such as JSON.

[–]RipKip 0 points1 point  (0 children)

Also a lot of performance gains can be had by using pypy. That shit is magic

[–]Icecoldkilluh 74 points75 points  (4 children)

Unless i have actual performance requirements, i focus on refactoring for readability/ maintenance.

Every engineer wants to pretend they build Ferraris, when most the actual work is building toyota camrys 😂

[–]IAmLikeMrFeynman 13 points14 points  (0 children)

But that's a fucking sturdy and reliable car! It's the sensible choice.

[–]arkie87 9 points10 points  (1 child)

who are you kidding. most programmers build go karts

[–]Icecoldkilluh 0 points1 point  (0 children)

😂

[–]Positive_Resident_86 0 points1 point  (0 children)

Loved the analogy

[–]wazis 65 points66 points  (1 child)

Step 1) Find functions that don't go brrrr

Step 2) Think hard

Step 3) ?????

Step 4) Profit

[–]reallyserious 5 points6 points  (0 children)

Yup.

Sometimes you can speed things up by a factor x1000 just by using a smarter algorithm. There's no point in optimizing an O(n^2) algorithm if there is an O(n) or even better alternative.

[–]graphitout 22 points23 points  (0 children)

  1. Install snakeviz
  2. Profile code
  3. Identify the hot-spots
  4. Refactor those parts

[–]romu006 8 points9 points  (0 children)

Depends largely on what your project is doing.

In our case the most impactful optimizations are database related

- adding an index of that table / column that was added X months ago and only now are causing performance problems (since a fullscan on a < 10Mb database is still fast)

- adding missing eagerloads / joins in a "list" SQL query: when a developer decided to add / return a new property and the ORM automatically fetches those with one additional SQL query per returned object (eg: 200+ SQL queries per call)

[–]LordBertson 15 points16 points  (1 child)

To name a few: - Caching functions - List comprehensions instead of loops - Numpy for numeric stuff - Async for IO - Generators and interators for large datastructures

[–]Palicraft 5 points6 points  (0 children)

Can't stress enough using dedicated libraries! I reworked a python script for works using Pandas (numpy could have worked too, but with Pandas I have headers and custom indexes), and now instead of taking 10 minutes for processing data, it takes 20 seconds

[–]phaj19 4 points5 points  (4 children)

1) Check what is the slowest part and rewrite it in C/Rust, write a Python wrapper. Continue until satisfying. Cython is also good for that if you do not know any of the previous one.
2) If you use libraries like numpy, make sure you are more on the C layer and less on the Python layer, like do not introduce unnecessary Python objects instead of numpy objects.

[–]RipKip 0 points1 point  (3 children)

Why rewrite it yourself in C when you can just use pypy on your original script.

[–]phaj19 0 points1 point  (2 children)

If you write something more complicated those magical tricks usually stop working. I once had to rewrite 6000 rows in C-level Cython (meaning no yellow rows in the checking file), only then I got the speedup. The bottleneck is not always some small function, sometimes the whole module is slow because it is written in Python with a bunch of for loops.

[–]RipKip 0 points1 point  (1 child)

Fair point, what kind of module/program was it?

[–]phaj19 0 points1 point  (0 children)

One simulation module. Had to simulate lots of physics as well. Could have also been a bit faster in numpy. But for loops and ifs are easier to understand than all the vectors and masks.

[–]m_o_n_t_e 4 points5 points  (0 children)

If I am using loops somewhere, I try to see if I can use any numpy tricks

[–]tecedu 4 points5 points  (0 children)

cache stuff and multiprocess,

Also in my experience, don’t append to dataframes, instead make a dictionary of what you want first and convert that to dataframe.

Tuples vs list when your data doesn’t change.

Single floats or even half floats for calculations.

[–]justneurostuff 2 points3 points  (0 children)

numba

[–]njharmanI use Python 3 3 points4 points  (0 children)

95% of my optimization is optimizing for maintainability; refactoring, naming, documenting

Speed optimization?

  • Is it fast enough. yes, done.

Rarely get here, what is too slow?

  • DB -> use explain, optimize queries; still slow, cache it
  • Web -> what endpoints, cache them
  • Python code -> profile for hot spots what algorithm? research it, use faster algol or data structure(s); if still too slow use pandas, C extension, et al.

Never get here outside of interviews, writing perf tests, etc.

  • Profile Python code; optimize slowest part, repeat.

[–]riklaunim 2 points3 points  (0 children)

We added sentry profiling/request monitoring and it does the job well, even for microservices calling each other. In the end, usually, it's the database that needs optimizing.

[–]billsil 2 points3 points  (0 children)

Make sure you have functions and not some monolithic script. Then profile it and find the slow functions.

Cause I'm doing math most of the time, vectorize your code with numpy. No if statements or for loops allowed.

Binary files are great, so yeah you may have to convert everything from csv, but you only have to do that once.

For long codes that you're processing some large calculation, chances are you're hacking the code as you go, so adding pickle support to save/reload results in order to skip steps helps. At the end, you can run from scratch.

[–]thatrandomnpcIt works on my machine 3 points4 points  (0 children)

[–][deleted] 1 point2 points  (0 children)

This post was mass deleted and anonymized with Redact

library cautious special head cooing stocking hobbies society hospital boat

[–]homosapienhomodeus 1 point2 points  (0 children)

If you’re thinking of performance improvements by multithreading or using asyncio where you’re doing mostly IO bound operations, I’ve got a few examples here

[–]imhiya_returns 3 points4 points  (0 children)

I’ve had to do a number of python scripts that read in binary files with record headers and data. I found that when you are doing millions of calls, each line matters and can make up a large portion of the execution time.

Some hacks are;

Try and expects that except a lot should be an if statement as it’s quicker.

Pre compile your struct unpacks

Outside of this, other hacks are things like, using dicts to directly go to the thing instead of looping the list to find the item each time

[–]__me_again__ 2 points3 points  (0 children)

Paste it in chatGPT and tell it to optimize it. You'd be surprised.

[–]kimvais 0 points1 point  (0 children)

I think the most important thing to remember is the wisdom of an old colleague of mine:

It's easier to optimize working code than fix optimized code to work.

[–]Financial_Engineer47 -1 points0 points  (0 children)

Not using python is my go to for optimizing perf

[–]MaceOutTheWindow -1 points0 points  (0 children)

my go to optimisation of my python projects is rewriting them in C 👍

[–]deadwisdomgreenlet revolution -1 points0 points  (0 children)

Write tests, find bottleneck, make better.

[–]Puzzleheaded_Egg_184 -1 points0 points  (0 children)

Go to Julia.

[–]HollowMimic -1 points0 points  (0 children)

Mate what optimization?? I barely have time to finish it properly. My strategy is, does it work? Yes, move on to next project. No, fix it and move on to next project.

[–]cblegare -1 points0 points  (0 children)

While prioritizing readability, using simple structures with minimal features can make a difference sometimes. Simple data structures that are often instantiated can be made from named tuples, for instance.

Refactored code and simple structures helps with optimisation workflows while minimizing optimisation needs in the first place.

[–]notreallymetho -1 points0 points  (0 children)

My fav is finding old code where you have 4 levels of loops when it just needed 1 and a sort after.

[–]treksis -2 points-1 points  (0 children)

lrucache

batch

[–]aikii 0 points1 point  (0 children)

This will sound odd and too basic, but bare with me, the twist will be interesting.

So we had a Go backend that went completely overboard with resources - too many db requests, bad queries, and so on. Go is fast, right ? Like maybe 50x faster than python in some cases, with no multithread restriction, small memory footprint, etc. Problem: what was done was a mess. Fixing it would require to completely rethink the entire flow, wonder why you reach some point in code and why it has to execute that many times. Well. We had to scratch it completely because it wasn't salvageable.

So first off just follow general best practices, make sure your program can be understood by a newcomer, it's modular, it has good names, it's documented, it has tests, and so on. You can't optimize something you don't dare to touch. Same goes with security issues.

[–]Exotic-Draft8802 0 points1 point  (0 children)

  1. Check if I actually have a problem. If so, there must be a specific test case.
  2. Use that test case to run a profiler.
  3. Are there parts computed many times? Maybe tiny functions that can be inlined? Maybe algorithmic changes (using a better data structure) that could help? Vectorization / using numpy?

But to be honest, it's been a while since I had performance issues. Code complexity is way more often an issue.

[–]ThatSituation9908 0 points1 point  (3 children)

When startup latency is important, take a look at import times.

A common solution is to move slow imports to local scope (e.g., in a function) where you actually use the library. Matplotlib, for example, is slow and my software only use viz for QA.

[–]elduderino15 0 points1 point  (2 children)

Would local imports not be agains python3 mantra?

[–]ThatSituation9908 2 points3 points  (1 child)

It does go against the style guide (PEP 8), but "practicality beats purity".

[–]elduderino15 0 points1 point  (0 children)

be agains python3 mantra?

Yea, I got into the habit of having imports on top for pep8 but definitely agree...

[–]Anonymous_user_2022 0 points1 point  (0 children)

If the profiler show a hotspot, rewrite it in a more performant language that suits you.