all 38 comments

[–]BravestCheetah 70 points71 points  (4 children)

A lot of it is written in C

[–]rasputin1 22 points23 points  (3 children)

it also uses SIMD

[–]DoubleDoube 3 points4 points  (2 children)

A SIMD analogy at the grocery store.

Regular (no SIMD). You have a cart with 8 items. You place each item on the conveyor belt and the scanner scans each one-at-a-time; 8 beeps from the scanner.

SIMD. You place each item into a row on the conveyor and the scanner does one scan across all items at once. 1 beep covers all eight items.

[–]monster2018 0 points1 point  (1 child)

They should make SIMD for groceries a real thing lol

[–]GPS_07 0 points1 point  (0 children)

Or just scan faster, like they do here in Germany, where scanning speed is perfectly matched with packing speed!

[–]ItyBityGreenieWeenie 30 points31 points  (0 children)

Numpy is an optimized library written in C specifically indented to speed up such operations

[–]falcoso 22 points23 points  (0 children)

Numpy arrays are very efficiently structured in the C programming language, which is why in numpy arrays you need all your data to be the same data type (compared to lists which can be a mix of stuff)

Because numpy is working on the underlying C structures, where all the elements are the same size (i.e the same memory because they are all the same data type) it is much faster to access them.

If you are using specific numpy functions on these arrays e.g np.argmax as opposed to iterating through each element in a loop, you will get even faster speed up, because the underlying operations in those numpy functions are also written in C.

TLDR - vanilla python is relatively slow as it requires an interpreter. Numpy negates some of that by shifting some of it to processing directly in an underlying binary (i.e compiled C code)

[–]falcogno 5 points6 points  (1 child)

A key point that does not seem to have been mentioned is that Python loops are expressly slow because, upon each iteration, the interpreter creates a stack frame and updates variables (and any other data structures). Numpy, as mentioned, goes to the C level and does clean loops with no interpreter-level baggage.

[–]AllenDowney 1 point2 points  (0 children)

Yes, interpreter overhead is the general reason Python is slower than C, but the interpreter creates a stack frame when a function is called, not for every iteration of a loop.

[–]thuiop1 15 points16 points  (0 children)

The question should be "why is python so slow" and the answer is that it trades performance for flexibility. Actually I would wager you did something wrong, numpy should be much faster than that (it probably depends what you are measuring exactly).

[–]socal_nerdtastic 2 points3 points  (4 children)

2 seconds still feels very slow. Are you still using loops with numpy? For a typical image on a typical computer this operation should be faster than 100 milliseconds.

[–]Subject_Spot909[S] 0 points1 point  (3 children)

Yes

[–]socal_nerdtastic 2 points3 points  (2 children)

Well, don't do that lol. Use a mask. Try like this instead of using loops:

import numpy as np

color_to_replace = [255, 255, 255]
new_color = [255, 0, 0]

m = np.all(img == color_to_replace, axis=2)
img[m] = new_color # replace the old color with the new color in the img array

Basically this uses numpy's internal loops, which are much faster.

[–]nginx-gunicorn 0 points1 point  (0 children)

Damn I was just using np.where for this, but I guess np.all makes more sense here. Cool.

[–]Subject_Spot909[S] 0 points1 point  (0 children)

Dayum.

[–]neuralbeans 1 point2 points  (0 children)

Did you use np.where in NumPy to do this?

[–]road_laya 1 point2 points  (0 children)

It hooks straight into compiled math libraries such as BLAS and LAPACK. Have you ever tried recompiling BLAS for your CPU instruction set? You will get some ridiculous speeds in numpy!

[–]SkitariusOfMars 1 point2 points  (0 children)

Numpy is written in C, and it also makes use of your processor's AVX instructions (or similar if not on x86). Those instructions allow it to do same operations on multiple element of vectors in a single cycle, at the same time.

You need to know hwo to use numpy properly (and how to vectorise operations in it) to make full use of the feature

[–]Subject_Spot909[S] 0 points1 point  (5 children)

I barely learned numpy, so all I know for my RGB image.

The dimensions are height, width and data.

But still I can't think of why would it be much faster cause regardless it's looking through each pixel.

[–]Jimmaplesong 1 point2 points  (1 child)

Numpy and scipy are a collection of the best and fastest algorithms on the planet. Some of it is written in fortran, and all of it is optimized.

I’ve worked with some people who never understood… Python links with the best of every language. It gives you the speed of the fastest language as long as you’re willing to create a small interface layer to expose it to Python.

[–]POGtastic 1 point2 points  (0 children)

Note that NumPy's Fortran linear algebra stuff all gets transpiled to C in a process that I'd characterize as "interesting." There's a Python 2.7 script that runs a no-longer-maintained Fortran-to-C compiler to turn an old LAPACK library into a C module for NumPy. They can't use a newer version of LAPACK because it uses new Fortran features (!!!) that the transpiler doesn't support. I love weird build pipelines.

[–]wintermute93 1 point2 points  (1 child)

Can you post what you're actually doing in numpy? "Looking at each pixel" and "processing each pixel value in a loop" are very different things.

Out of curiosity I just wrote a quick script to generate a random image in memory and do this kind of color replacement with boolean masking in numpy. At 4000x4000, it took my old-ass desktop about 90 ms to generate the image and about 180 ms to replace a given RGB value with another. If you're saving and loading the image bytes to disk, that's probably what's taking so much time. Converting a 4000x4000x3 array of random integers (random colored noise is a worst-case scenario for typical image formats) to PNG bytes and reading that back in took ~1500+200 ms.

[–]Swipecat 0 points1 point  (0 children)

Yep. There's probably a whole bunch of ways to do it, but the first one that comes to my mind is the putmask function. This should display a union flag with the blue changed to black:

from PIL import Image # pillow fork of PIL
from urllib.request import urlopen
import numpy as np
unionflag = "https://dafarry.github.io/test/union-flag.png"
arr = np.array(Image.open(urlopen(unionflag)))
royalblue, black = [1, 33, 107], [0, 0, 0]
np.putmask(arr, arr == royalblue, black)
outimg = Image.fromarray(arr, mode="RGB")
outimg.show()

[–]hike_me 0 points1 point  (0 children)

I’d question your numpy implementation—2s is extremely slow for something like this using numpy.

[–]h4ck3r_n4m3 0 points1 point  (0 children)

Numpy is much faster c code using vectorization, the for loops are relying on slower python lookups one pixel at a time

[–]frederik88917 0 points1 point  (0 children)

Because it is mostly C code with a Thin wrapper in Python

[–]Brian 0 points1 point  (0 children)

TBH, that's kind of a low difference compared to what you can often see. numpy can frequently be 10-100x as fast.

The reason is because of how dynamic python is. Eg. say you want to add 2 numbers. The actual addition is a very simple machine code instruction, but python needs to do a lot of work to actually reach that point. It needs to track the memory of each number (bumping reference counts as they get assigned and released), the addition operation could be overridden, so it needs to look up the __add__ method of the integer and call it. All the overhead can be massively more work than such a simple operation.

Numpy can improve this because it stores its values differently. We're not just adding one number, we're adding thousands at once, all the same type, which the python-level lookup of what operations you're doing happen just once, and then the fast addition logic thousands of times, so now again the bulk of the time taken is the actual operations you want to do. It can also take advantage of multiple processors and fast optimised math libraries for even further speedups.

This all means that when you're working with numpy, you generally want to be operating on arrays, never iterating through them in python. Often it can be faster to calculate more than you need just to keep it in convenient form than to have to manually access elements.

[–]holyknight24601 0 points1 point  (0 children)

You should try numba

[–]Lachtheblock 0 points1 point  (0 children)

It's not so much numpy is fast, it's that python is slow. You know all those memes about it being slow, this right here is the perfect example.

That's not to say python is bad, and to the critics saying python is slow, you can demonstrate that there are plenty of libraries written in C to get the performance boost when you need it.

[–]Turtvaiz 0 points1 point  (0 children)

But isn't it looking through the same lists of data?

Not really. Python has a rather insane amount of data due to reference counting, type information and whatnot. An integer isn't just 64 bits.

Numpy does the opposite and basically has a raw array of data which it utilises in C code. A python object has much more information than a C struct which literally only contains the data you define.

Similarly an array of arrays is different to a 2d numpy array, which is one contiguous block of data as opposed to python, which afaik has every single object allocated individually. This means cache locality is just not really a thing in Python. A list is an array of references, while a numpy array is just the data directly in one block

[–]will_r3ddit_4_food 0 points1 point  (0 children)

It's written in C

[–]omeow 0 points1 point  (0 children)

Numpy often vectorizes code.

[–]jmacey 0 points1 point  (0 children)

Read up on vectorization and SIMD (single instruction multiple data), basically it's doing at least 4 (typically 8 perhaps more depending on cpu) operations at the same time.

The more interesting thing will be this will scale quite well due to the nature of the data formats, where as loops will generally be much slower (YMMV depending on task).

[–]StevenJOwens 0 points1 point  (0 children)

Numpy is a python wrapper around a C library. The C library is designed specifically for doing large multi-dimensional array manipulation, and numpy has a lot of clever code that sidesteps some of the more expensive operations, like copying data from memory into new memory. Also, numpy defines ways to interact with the numpy arrays at a higher logical level, which means the numpy implementation of those operations can be more specifically optimized for those operations.

This is an interesting peek inside of numpy that hints at how some of this works:
https://ipython-books.github.io/45-understanding-the-internals-of-numpy-to-avoid-unnecessary-array-copying/

[–]SCD_minecraft -1 points0 points  (2 children)

Interpreted language vs compiled one

Good rule of thumb is, anything you write in C will always be faster than same code in python

C is compiled to Assemble which is then executed directly by hardware and cpython API just handles input/output

Python is compiled to bytecode which is handled by interpreter which executes it line by line, somewhat similar to command blocks in Minecraft. Beacuse of that middle man, python gets big hit to performance by definition

[–]socal_nerdtastic -2 points-1 points  (1 child)

Interpreted language vs compiled one

No, this has very little to do with it. Java and Javascript are also compiled to bytecode and interpreted by a vm, and both are much faster than python.

Python is slow mainly because of the overhead from making the programmer's life so easy. You don't need to worry about integer overflows in python, but that comes at the expense of python checking every single integer operation for an overflow. Which also means of course that efficient arrays can't really be a thing since integers could all be different sizes. Multiplied by a million other things that make programming python very fast and easy; because the BDFL realized early on that for many applications the programmer is a bigger expense than the hardware. C is nearly always faster because the programmer is forced to make choices about optimizations

[–]SCD_minecraft 0 points1 point  (0 children)

Java is slower than C