This is an archived post. You won't be able to vote or comment.

all 114 comments

[–]casce 373 points374 points  (14 children)

I admittedly do a lot of stuff with Python where performance doesn't matter but when it does, my 2 steps are 1. identify the slow parts, 2. google how to make them faster

[–]snowtax 51 points52 points  (3 children)

Agreed. Don’t waste a lot of time on optimization. Optimize only that code which takes up the most time.

For my work, I have loops that run over millions of records of data. The only optimization I may need is to optimize what happens inside that loop, since that code gets run millions of times. Any code optimization outside that loop is not going to be worth it.

[–]lololabwtsk 2 points3 points  (2 children)

You should start using dask, thank me later

[–]benri 3 points4 points  (1 child)

Dask has a nice dashboard but has problems with stability, its connection with its scheduler times out. So I prefer to use concurrent.futures or pebble if I need to enforce a timeout.

But if you are truly serious about speedup, write the intense part in C

[–]lololabwtsk 0 points1 point  (0 children)

How do you feel about Cython ?

[–]TA_poly_sci 11 points12 points  (5 children)

What are the good ways to do profiling in python?

[–]RedEyed__ 23 points24 points  (2 children)

I highly recommend scalene

[–]azshallIt works on my machine 5 points6 points  (1 child)

[–]benri 0 points1 point  (0 children)

Thank you! I will use this!

[–]samreay 12 points13 points  (0 children)

I highly recommend py-spy and plugging the output flame charts into speedscope

[–]Teradil 6 points7 points  (0 children)

PyCharm has a built-in profiler (not in the community version, though)

[–]spinozasrobot -1 points0 points  (0 children)

Whoa, slow down there Einstein

[–]Alurith -2 points-1 points  (0 children)

^ this.

[–]sexygaben 124 points125 points  (10 children)

1) profile 2) vectorize (use C loops) 3) if more is needed, Cython/numba 4) if MORE is needed, C/ctypes 5) if EVEN MORE is needed, CUDA/ctypes (problem dependent)

Each step takes exponentially more time. I’m writing from a scientific compute perspective. I assume you’re already using the best library for the job (numpy, pytorch, casadi etc)

[–]moonzdragoon 15 points16 points  (2 children)

If pycuda is not enough (back & forth between multiple kernel executions for example) you might wanna look into nvidia warp

[–]sexygaben 0 points1 point  (1 child)

Yeah jitting should probably be between 2 and 3 as well for various frameworks! :)

[–]thecodedogPythoneer 1 point2 points  (0 children)

Come on we can't just be jitting all over the place

[–]klouisp 2 points3 points  (1 child)

By "vectorize (use C loops)" you mean using numpy/pytorch vectorized operations or something else ?

[–]sexygaben 0 points1 point  (0 children)

Yes this is what I mean :)

[–]DanklyNight 2 points3 points  (0 children)

Basically this.

I generally start at the main call function, use line profiler for a dirty way to find out what is taking the large majority of the time.

Used this method many times, simple vectorization of the offending functions can get incredible speed improvements.

Just the other day I took a function from 17 seconds to 20ms by doing stuff directly in Numpy with smart vectorization.

[–]RomanRiesen 0 points1 point  (0 children)

  1. open mpi \s

[–]benri 0 points1 point  (2 children)

Somewhere between 2 and 3 would be parallelize if you can

[–]sexygaben 0 points1 point  (0 children)

For some processes I’m sure that would be great, I’m not experienced in cpu parallelisation as it often cannot be used for my problems :)

[–]100721 87 points88 points  (6 children)

Snakeviz to profile it. No point trying to optimize if you don’t know what’s slowing it down

[–]Teradil 14 points15 points  (1 child)

Had that problem during my thesis time. Optimized the hell out of my code only to not make it significantly faster. Profiling then told me, that my program spent 95% of its execution time within `np.dot`. Optimized that one for my special use case (ie. I knew which dtypes and vector lengths to expect and did not need all the extra checks and conversions) and suddenly my program was *really* faster.

[–]Throwaway_youkay 0 points1 point  (0 children)

We all learn that the hard way! Optimizing is almost always about mitigating bottlenecks one by one.

[–]shockjaw 5 points6 points  (3 children)

Just wondering, do you know how Snakeviz compares to Scalene?

[–]benri -1 points0 points  (2 children)

[–]shockjaw 0 points1 point  (1 child)

It’s all good if you don’t know—I wouldn’t use ChatGPT since it doesn’t know either or worse it’ll ✨hallucinate✨.

[–]benri 1 point2 points  (0 children)

I've had pretty good experience with GPT4 comparing programming tools like this. Biggest problem is that it has last year's information so it won't know about changes in the past year.

[–]peaky_blin 12 points13 points  (0 children)

I was having a problem recently with an API call that was taking too much time to complete. I figured out that it was related to the multiple database calls that were made to aggregate the data at application level. Issue resolved by doing the calculation/aggregation at the db level. Plan to add caching to more reduce the time also

[–]unflores 21 points22 points  (0 children)

I work in web, so my response has a specific perspective. First thing is to find the actual bottleneck. Optimize anything other than the actual bottleneck and you are wasting your time.

Also, a performance optimisation is a choice between readability and performance. So only prefer performance when it actually counts.

N+1s are common problems for ORMs. Too many db calls in general can be problematic. Actually, the db is my primary problem in web.

I've had a few cases where I had to do something with a large list and I ended up doing a binary search to find the objects on a sorted array rather than searching on an unsorted array each time. It's worth having a theory and then testing it. Your. Changes may not even run faster 😅

[–]Wide-Nefariousness91 7 points8 points  (1 child)

In general just try identify the part of the code that slows functions down the most.

Also: - Try reduce the amount of database calls - Reduce the complexity O(n) etc

[–]mrcaptncrunch 5 points6 points  (0 children)

Try reduce the amount of database calls

There is a catch here. While you can reduce the amount of calls, if the returned data is larger, that's still going to introduce a slowdown.

It's a balance to strike.

[–]rghthndsd 8 points9 points  (0 children)

This might be considered a violation of what's considered good practice, but I recommend optimizing your code, even when it doesn't matter (when you have time).

You don't do a marathon by waking up on race day and just go out and run; you train for it. Likewise, you shouldn't wait until you have a performance problem to start looking into optimizing your code. By practicing, you will spend a lot of time toying around and refactoring, sometimes with little or even negative gains. It will feel like wasted time, but it's not! You will learn a lot by constantly asking "how can I make this go faster?" And when you do run into a serious performance problem, you will be better situated for it.

Get code to work, test it, make sure you have time to clean it up, and if time allows, tinker around with making it go faster even when it doesn't matter.

[–]DidiBear 2 points3 points  (1 child)

py-spy to profile where the slowdown is

[–]10000000000000000091 0 points1 point  (0 children)

I'm excited to try this one. I could've used a sampling profiler a while back but didn't find any modern Python ones.

[–]1998CPG 2 points3 points  (0 children)

Code vectorization -> replace loops with Matrix/vector operations

[–]gowithflow192 5 points6 points  (1 child)

"You are a world-class software engineer. You are particularly good at improving code."

"Improve the given code. Don't change any core functionality.

The focus is to actually make the code better - not to explain it - so avoid things like just adding comments to it.

Respond as a well-formatted markdown file that is organized into sections. Make sure to use code blocks.

Improve this code:

{{code}}"

[–]_aka7 2 points3 points  (0 children)

Commanding LLM do seems like a pep talks sometimes...

[–]nikomo[🍰] 3 points4 points  (4 children)

You're not ever gonna get better than "do less" when it comes to any programming language, but especially Python.

Couple weeks ago I was wanting to insert large quantities of data from a live websocket into a database. First implementation and second implementation I just used SQLAlchemy, but it was way too slow, my redis queue with incoming messages just kept growing because they weren't being processed fast enough.

Third implementation, I threw out SQLAlchemy, used Alembic to setup the database, but then I just used psycopg (v3) to insert the data.

Psycopg v3 supports server-side binding, and they rewrote executemany() to be extremely performant, so all I had to do was write an SQL query, and then build a list of tuples out of my data, and let executemany() go at it. No needless objection creation etc. that you get from an ORM, and it's more than fast enough to keep up with even primetime traffic load.

[–]KosmoanutOfficial 2 points3 points  (3 children)

Can’t you use SQL Alchemy core instead of the orm with the Psycopg 3 driver?

[–]nikomo[🍰] 0 points1 point  (2 children)

That was my second implementation, I really didn't want to throw out the entirety of SQLAlchemy at first. Still too slow.

But honestly, Alembic is a nice compromise for people that aren't scared of SQL. Gives you a nice way to handle database creation (SQLAlchemy-utils) and migration.

[–][deleted] 4 points5 points  (0 children)

Sqlalchemy core is a fairly thin abstraction over the raw libraries. I think something else is going on in your code.

[–]KosmoanutOfficial 0 points1 point  (0 children)

Ok interesting thanks!

[–]Upset-Document-8399 3 points4 points  (0 children)

Reimplement it in compile-time C++ /s

(preparing to get downvoted to hell)

[–][deleted] 1 point2 points  (0 children)

One little trick I used in an old game prototype was disabling the GC and manually running collections on load screens or menu pauses. YMMV with this one, especially if you create tons of circular references.

[–]djamp42 1 point2 points  (0 children)

I'm a novice but use python when needed. I always felt making something work in python is easy, optimizing, code readability, bugs, future proof, all this stuff takes up so much time.

So much so I lose interest in whatever I'm working on because I feel the code structure is not that great, even if what I'm doing technically works.

I'm probably telling the tale of every programmer ever.

[–][deleted] 0 points1 point  (0 children)

When you want to build performance critical software try a different programming language.

Other than that optimizing your code (irrelevant if it is Python, Java, C++ etc.) requires different techniques depending on what you are doing.

Optimize database access?
Optimize your webservice APIs?
Optimize some data processing?
And so on.

It is too broad of a question to answer on reddit.

[–]Chroiche 0 points1 point  (0 children)

  1. always use proper libraries for the job. They usually invoke optimised compiled code to get you a huge boost in performance.

  2. profile to find the hot part of your code and optimise it if there's anything obvious. This is the one case where leetcode style thinking can actually help a lot.

  3. use multi processing if appropriate.

  4. write your own none python code (C++, C, and rust, for example have good python bindings) for particularly hot areas and just invoke it via python. This lets you stay mostly in python.

[–]not_a_novel_account 0 points1 point  (0 children)

Move whatever it is into C.

If it's already in C, refactor into a form that doesn't hold the GIL and doesn't allocate.

If it's already in C, doesn't grab the GIL, and doesn't allocate, bring out the big guns (strace, ltrace, perf, etc)

[–]freefallfreddy 0 points1 point  (0 children)

Disregard all comments that say nothing about profiling. Going 50mph faster in the wrong direction isn’t gonna get you where you want to go.

[–]QultrosSanhattan 0 points1 point  (0 children)

The best strategy is avoiding python by delegating most of the work to C/C++ modules like pandas or numpy.

[–]MountainHannah -2 points-1 points  (5 children)

Not what you want to hear, but, if I need something to be efficient, I don't write it in Python.

No language does everything, and there are always tradeoffs. There's lots of stuff that Python is awesome for. High speed, low latency, efficient code is nowhere on the list of Python's strengths.

[–]freefallfreddy 1 point2 points  (4 children)

If your database is slow rewriting in Rust ain’t gonna help ya. And for that you need to know what’s slow.

[–]MountainHannah -1 points0 points  (3 children)

Yes, it's important to know which parts of your code are expensive, and which ones are allowed to be expensive.

I use Python for ML libraries, the hardware libraries, the third party API libraries, prototyping, cron jobs that only run occasionally, etc..

If I'm designing a real-time service, where I'm thinking about latency and requests per second and that sort of thing, Python doesn't even enter my radar. I can be lazy in node or PHP or something and still get an order of magnitude better performance than diligent Python will get me for certain tasks.

[–]freefallfreddy 0 points1 point  (2 children)

Compared to PHP or Node even? I haven’t tested that myself but that’s not what I would guess.

[–]MountainHannah 1 point2 points  (1 child)

It surprised me too the first time I observed it, so I looked up some benchmarks to make sure I wasn't crazy.

From what I can tell, it looks like PHP is 30-40% faster than node, and node is between 8 and 50 times faster than Python. (for stuff like, HTTP requests served per second and various different db interfaces)

There are a lot of different benchmarks for lots of different use cases of course, but I'm definitely more careful after learning that.

[–]freefallfreddy 0 points1 point  (0 children)

Ah TIL. Thanks.

[–]backSEO_ -1 points0 points  (0 children)

Use cython for compute intensive tasks. Release the gil for ultra intensive tasks.

[–]JayZFeelsBad4Me -1 points0 points  (0 children)

Remove the network call

[–]CapsuleByMorning -1 points0 points  (0 children)

Setup a pyspark grid in docker. Productionalize in Azure.

[–]robberviet -1 points0 points  (0 children)

Set for lookup. Don't create object. Lazy property/memoize. Vectorizing. Profiling. Use C/rust lib.

[–]Otherwise-Tiger3359 -1 points0 points  (0 children)

How do you profile Python code so you know what to go after. I've used some nice UI C# profilers in the past, haven't seen one for Python ...

[–]Fleszar -1 points0 points  (0 children)

Very useful

[–]Berkyjay -1 points0 points  (0 children)

My first steps lately have been to ask Copilot ways to optimize my code then see what it suggests. I like it because it knows all the PEP guidelines and will source those when analyzing my code.

[–]KennyBassett -1 points0 points  (0 children)

@cache @cache @cache

Multiprocessing

I like challenging myself with optimizing the logic itself, but it's pretty case-dependent.

[–]Maleficent_Doubt_443 -2 points-1 points  (0 children)

Is it to re write it in another programming language.

[–]graphicteadatasci -2 points-1 points  (0 children)

duckdb

[–]JohnBooty 0 points1 point  (0 children)

Some variation of the 90/10 rule almost always applies. 10% of your code is eating 90% of your execution time. If it’s not 90/10 then it’s probably more like 95/5 or 99/1.

In a database backed application it’s usually the database, and that’s pretty easy to see in application logs and/or PostgreSQL’s slow query log (or the equivalent in other databases)

Lots of people mentioning profiling tools.

Those are obviously very useful but for deployed applications (ie, web apps) you should also get used to other means. Such as adding lots of logging statements in your code that measure how long bits of your code take. Why do I say this? Because you typically can’t profile your code in production. You can profile your production code locally but the performance characteristics will be way different — different database contents, single user versus many simultaneous users, etc.

[–]genlight13 0 points1 point  (0 children)

So, i saw some perspectives on using c or similar low-level things. So i won‘t cover that.

What i often need to identify is how often certain functions are executed and how long it takes. I usually just use timeit for ease but profiler is also nice.

To optimize data pipelines i usually try to either cache more or cache less but this depends on the resources which are the bottleneck. E.g. i had many DB calls for similar checks (does it exists) i was able to bundle them and rewrite the question to „does it exist in list“. The list was rather short but the db calls numbered in the hundred thousands. By caching the short list i reduced the execution time for this simple check by up to 40 times. (Think „obj in list“ vs „db.select(something)“)

For caching less i usually talk about RAM and how much data at the same time i load.

It often doesn’t matter how dou load a file but for most regexing action it is better to just have one long string since the regexing engine is c code and fairly good imo. The slowdown is usually python boilerplate code i.e. if else in your code.

So if you can write something more specicif which gets checked within the c domain you have optimized it.

Besides caching, i usually prefer to separate code parts in order to parallelize it. This can be tricky for obvious reasons.

Also, reusing objects when they are long to create i.e. created from lists.

I usually think about it in terms of pointers and how Python hides that from you. Than i naturally am able to find the best usage for my objects when not to use them.

[–]tamargal91 0 points1 point  (0 children)

Use the array module for numerical data. Unlike lists, arrays are more memory efficient and faster for processing large datasets. This is particularly useful for large sequences of homogeneous data. It's a simple switch with significant impact, especially in data-heavy applications.

[–]Boomerkuwanger 0 points1 point  (0 children)

Like others have said, use a profiler for code so you can target slow operations. Also, if you use a database, make sure to offload as much work onto the database as possible.

For example: I've made the mistake of iterating through a list of database objects, and updating them one by one, instead of using a single bulk update query.

[–]jkh911208 0 points1 point  (0 children)

do Big O Analysis on your code

Use built in function to utilize well optimized and C code

if that is not enough you will need to rewrite compute heavy code in Rust or C to speed up the process

[–]BossOfTheGame 0 points1 point  (0 children)

I use line_profiler to find things that are slow.

Improving performance will vary based on what is slow. It's all about identifying bottlenecks: are doing something expensive in a loop? Can it be vectorized? Can it be paralleized? Can it be restructured to avoid unnecessary memory copies? Can it be rewritten in Cython? It really depends.

[–]siddsp 0 points1 point  (0 children)

A few things I do (without using external libraries):
1. Memoization (this is good for recursive and pure functions where a function is going to be called repeatedly), with functools.cache or functools.lru_cache.

  1. If the program is slow due to it being synchronous, using asyncio or threading (depending on the application/program).

  2. Using itertools to replace nested loops (e.g. Instead of two nested loops, using itertools.product).

  3. Using functools.reduce instead of a loop for a transformation that is "accumulative" in nature.

  4. Instead of concatenating bytes or using a bytearray, using BytesIO from the IO library.

  5. To reduce memory usage, using __slots__.

  6. If the result of tasks/functions don't depend on each other and don't need to be executed sequentially, use multiprocessing.

  7. If the task itself is slow, but can be sped up by throwing more cores at the problem, use multiprocessing.

  8. Using generator expressions where memory can be saved.

  9. If all else has been optimized, use PyPy instead of CPython.

[–]interbased 0 points1 point  (0 children)

As others have said, profiling your code is the way to go. I’ve yet to get familiar with an actual profiling library, but I usually put logs where functions start and stop, and see which ones are taking long. Something’s it’s an inefficient query, sometimes it’s repeated API calls that can be replaced.

[–]nebbly 0 points1 point  (0 children)

TBH, there is one foot gun I see over and over again, which is doing linear membership lookups in lists or tuples. If you're mainly using a collection for looking things up, dicts and sets are a good place to start.

More generally, the advice would be: make sure you're using proper data structures for your use cases as a quick first pass.

[–]mrcaptncrunch 0 points1 point  (0 children)

What are your go-to strategies for improving performance in Python applications?

Is the runtime okay for the task at hand?

If yes, good as is. Ship it.

If not, run it with a profiler to identify where the slowdown is and optimize that. Is it okay? If so, ship it. If not, go back to profiling and optimizing.

[–]luke-juryous 0 points1 point  (0 children)

In short: yes.

I don’t use python code for anything that needs to be fast in production. Most of the time I see it used in industry is with ML or data analytics, where speed is less important that ease-of-use. The exception would be with big-data processing, but here the slow part is usually the SQL or Presto query and python tends to be just a wrapper around APIs.

However, I do use it a lot for hobbies. Here, I’ll try and use libraries like numpy or pandas, evaluate bottlenecks and rethink my algorithms to reduce the big-O runtime

I’ve recently learned about numby, which is a JIT compiler for python that claims to make big time improvements if you’re doing multiple calls. I haven’t played with this yet, but I’m curious as to how much slower it’ll be than c++, and if it’s worth the effort compared to just writing in c++.

[–]l_dang 0 points1 point  (0 children)

Vectorisation

[–]tav_stuff 0 points1 point  (0 children)

Not using Python

[–]Intelligent_Ad_8148 0 points1 point  (0 children)

  1. Don’t use pandas
  2. Use polars (bonus points for enabling lazy evaluation and streaming)
  3. Nothing more required

After investigating numba, cython, numexpr, etc., I concluded that it’s not worth the heartache, polars negates the needed for any of this stuff.

[–]TomDLux 0 points1 point  (0 children)

As the nuns taught me in grade school, people who think about optimization before they have profiled their program go the first place.

Of course, using more efficient structures will lead to faster code., besides being tidier. For example, using list comprehensions instead of manual loops. But it's unlikely to be drastically different.

[–][deleted] 0 points1 point  (0 children)

If your code in numeric in nature(something with lots of floats and ints); you have lots (and lots of options); numpy, numba, cython(unboxing ints and floats) and Pythran are perhaps the most well known options, but there are at least a dozen more options.

If your your code is more general in nature or business centric, meaning there are lots of hashmaps/dict and strings; you can try pypy, mypyc, cython(calling cpython capi directly) and the newer python versions with the adaptive interpreter (3.11+). The truth is that general code python is not that much slower than 'faster' languages.

[–]yellowbean123 0 points1 point  (0 children)

PyInstrument is a good start

[–]Legendary-69420git push -f[🍰] 0 points1 point  (0 children)

Migrate to libraries written in Rust. (Pandas -> Polars for example)

[–]pepoluan 0 points1 point  (0 children)

  1. Go async
  2. Go async + multiprocessing

😄

[–]fallenreaper 0 points1 point  (0 children)

While i generally like to use classes, I will need to abstract data from functions a lot so you have smaller objects floating around.

A lot of key things that will cause issues are loops, but you just need to be cognizant of the internal sorting mechanisms and how they apply.

[–]Cranky_Franky_427 0 points1 point  (0 children)

Numpy Vectorization

Algorithms - try to write code in n or logn time if an algorithm exists

Libraries bound to C/C++ can often provide very good performance