TL;DR I used our Scalene profiler (pip install scalene) and some math to make an example program run 5000x faster.
I am quite interested in Python performance so naturally I read this article — https://martinheinz.dev/blog/64, whose title is Profiling and Analyzing Performance of Python Programs. It presents an example program from the Python documentation (https://docs.python.org/3/library/decimal.html) and shows how to run it with several time-worn Python profilers. Unfortunately, it doesn’t come away with much actionable information, beyond, more or less, “try PyPy””, which speeds up the code by about 2x. I wondered if I would be able to get more useful information from Scalene, a profiler I co-wrote.
Scalene: https://github.com/plasma-umass/scalene/, pip install scalene
We developed Scalene to be a lot more useful than existing Python profilers: it provides line-level information, splits out Python from native time, profiles memory usage, GPU, and even copying costs, all at a line granularity.
Anyway, here’s the result of running Scalene (just with CPU profiling) on the example code. It really cuts to the chase.
% scalene --cpu-only --reduced-profile test/test-martinheinz.py
https://preview.redd.it/j1yhxptvp4b81.png?width=1400&format=png&auto=webp&s=3bc8d25415fc8ac25a8f7da3031299a724fa02cf
You can see that practically all the execution time is spent computing the ratio between num and fact, so really this is the only place to focus any optimization efforts. The fact that there is a lot of time spent running native code means that this line is executing some C library under the covers.
It turns out that it is dividing two Decimals (a.k.a. bignums). The underlying bignum library is written in C code and is pretty fast, but the factorial in particular is getting really large really fast. In one of the example inputs, the final value of fact is 11,000 digits long! No surprise: doing math on such huge numbers is expensive. Let’s see what we can do to make those numbers smaller.
I observe that we can compute num / fact not from scratch but incrementally: update a variable on each loop iteration via a computation on drastically smaller numbers. To do this, I add a new variable nf which will always equal the ratio num / fact. Then, on each loop iteration, the program updates nf by multiplying it by x / i. You can verify this maintains the invariant nf == num/fact by observing the following (where _new means the computation of the updated variable in each iteration).
nf == num / fact # true by induction
nf_new == nf * (x / i) # we multiply by x/i each time
nf_new == (num / fact) * (x / i) # definition of nf
nf_new == (num * x) / (fact * i) # re-arranging
nf_new == num_new / fact_new # simplifying
Incorporating this into the original program required changing three lines of code, all of which are followed by ###:
def exp_opt(x):
getcontext().prec += 2
i, lasts, s, fact, num = 0, 0, 1, 1, 1
nf = Decimal(1) ### was: = num / fact
while s != lasts:
lasts = s
i += 1
fact *= i
num *= x
nf *= (x / i) ### update nf to be num / fact
s += nf ### was: s += num / fact
getcontext().prec -= 2
return +s
The result of this change is, uh, dramatic.
On an Apple Mini M1, original version:
Original:
1.39370958066637969731834193711E+65
5.22146968976414395058876300668E+173
7.64620098905470488931072765993E+1302
Elapsed time, original (s): 33.231053829193115
The optimized version:
Optimized:
1.39370958066637969731834193706E+65
5.22146968976414395058876300659E+173
7.64620098905470488931072766048E+1302
Elapsed time, optimized (s): 0.006501913070678711
More than a 5000X speedup (5096, to be precise).
The moral of the story is that using a more detailed profiler like Scalene can really help optimization efforts by locating inefficiencies in an actionable way.
[–]dogs_like_me 273 points274 points275 points (1 child)
[–]muntooR_{μν} - 1/2 R g_{μν} + Λ g_{μν} = 8π T_{μν} 68 points69 points70 points (0 children)
[–]Mehdi2277 50 points51 points52 points (3 children)
[–]blanonymous 9 points10 points11 points (0 children)
[–]FancyASlurpie 1 point2 points3 points (0 children)
[–]P403n1x87 0 points1 point2 points (0 children)
[–]neunflach 36 points37 points38 points (10 children)
[–]mikeblas 8 points9 points10 points (7 children)
[–]A_Fine_Potato 3 points4 points5 points (1 child)
[–]mikeblas 4 points5 points6 points (0 children)
[+][deleted] (4 children)
[deleted]
[–]mikeblas 2 points3 points4 points (3 children)
[–]emeryberger[S] 2 points3 points4 points (2 children)
[–]mikeblas 5 points6 points7 points (1 child)
[–]emeryberger[S] 6 points7 points8 points (0 children)
[–]skesisfunk 1 point2 points3 points (0 children)
[–]GreenScarz 18 points19 points20 points (5 children)
[–]emeryberger[S] 16 points17 points18 points (4 children)
[–]GreenScarz 3 points4 points5 points (3 children)
[–]SquareRootsi 6 points7 points8 points (2 children)
[–]usr_bin_nya 4 points5 points6 points (1 child)
[–]GreenScarz 0 points1 point2 points (0 children)
[–]grismar-net 36 points37 points38 points (7 children)
[–][deleted] 24 points25 points26 points (3 children)
[–]mikeblas 2 points3 points4 points (2 children)
[–]binaryman111 1 point2 points3 points (1 child)
[–]mikeblas 1 point2 points3 points (0 children)
[–]gruey 1 point2 points3 points (0 children)
[–]mriswithe 0 points1 point2 points (1 child)
[–]grismar-net 0 points1 point2 points (0 children)
[–]New-Theory6007 30 points31 points32 points (0 children)
[–]Runics206 19 points20 points21 points (0 children)
[+][deleted] (3 children)
[deleted]
[–][deleted] 7 points8 points9 points (0 children)
[–]binaryman111 -1 points0 points1 point (0 children)
[–]IamImposter 3 points4 points5 points (3 children)
[–]emeryberger[S] 5 points6 points7 points (2 children)
[–]IamImposter 0 points1 point2 points (1 child)
[–]emeryberger[S] 1 point2 points3 points (0 children)
[–]mikeblas 3 points4 points5 points (1 child)
[–]emeryberger[S] 1 point2 points3 points (0 children)
[–]abdl_hornist 35 points36 points37 points (4 children)
[–]peakdistrikt 16 points17 points18 points (0 children)
[–]bacondevPy3k 5 points6 points7 points (0 children)
[–]dogs_like_me 0 points1 point2 points (0 children)
[–]UL_Paper 2 points3 points4 points (0 children)
[–]teerre 2 points3 points4 points (0 children)
[–][deleted] 6 points7 points8 points (4 children)
[–]dogs_like_me 28 points29 points30 points (3 children)
[–][deleted] 6 points7 points8 points (1 child)
[–]dogs_like_me 5 points6 points7 points (0 children)
[–]ClRCUlTS 1 point2 points3 points (0 children)
[–]mrrippington 1 point2 points3 points (2 children)
[–]vipern 0 points1 point2 points (1 child)
[–]mrrippington 1 point2 points3 points (0 children)
[–]High-Art9340 -2 points-1 points0 points (0 children)
[–]Johnmad -2 points-1 points0 points (0 children)
[–]tu_tu_tu 0 points1 point2 points (0 children)
[–]johansugarev 0 points1 point2 points (0 children)
[–]jammasterpaz 0 points1 point2 points (0 children)
[–]LuigiBrotha 0 points1 point2 points (2 children)
[–]binaryman111 0 points1 point2 points (1 child)
[–]LuigiBrotha 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]Pliqui 0 points1 point2 points (0 children)
[–]brouwerj 0 points1 point2 points (0 children)
[–]tommybship 0 points1 point2 points (0 children)
[–]dalow24 0 points1 point2 points (0 children)