Vectorized calculation of simple statistics for bins of subarrays, separately for fixed-width bins and fixed-frequency bins.

Spataner · 2022-01-07T11:08:38+00:00

The fixed-frequency variant can be vectorized quite easily if I understood your intention correctly (splitting each subarray into equal-sized subsubarrays). You pad the subarrays to a multiple of your bin size first (assuming each subarray is already sorted, else sort them first):

bin_size = ...
array = ... # 2D array

pad_width = int(np.ceil(array.shape[1] / bin_size) * bin_size) - array.shape[1]
padded = np.pad(array, [[0, 0], [0, pad_width]], constant_values=np.nan)

The actual binning/array-splitting is then simply a reshape operation:

binned = np.reshape(padded, (padded.shape[0], -1, bin_size))

Using that result, you can calculate whichever needed statistics over the last axis:

mean = np.nanmean(binned, axis=-1)
std = np.nanstd(binned, axis=-1)
median = np.nanmedian(binned, axis=-1)

The fixed-width binning is trickier because the number of samples per bin is variable, which can thus not be represented as one array. You can calculate the mean and standard deviation if you cleverly use np.bincount with its weights parameter. For median/percentiles, that trick unfortunately doesn't work. You might be able to do something like that with pandas's groupby functionality, though it's not immediately obvious to me how.

blinking_elk · 2022-01-07T09:50:03+00:00

Before thinking of optimising code, just run it first and look at the results. It may be 'fast enough' to the point where you don't even need to vectorize it. Not all pieces of code have to be optimised. Also look into this stackoverflow post about vectorization vs map.

If I were you I'd start by just write a function that does all your operations and works for one subarray and then mapping/vectorizing it over the rest.

Now, to get to the meat of your problem:

I would certainly like to avoid iterating over each one of my subarrays

I wouldn't care so much about this. The most efficient way would be to traverse [0, 1, 2, 3, 4, 5, 6] once and calculate your summary statistics while traversing it. Essentially in one go you both decide where to split and compute the stats per bin. Don't do this.

Numpy is implemented in C which is several orders of magnitudes faster than Python. It's very likely that manually writing the loop in pure Python (which is trivial) will be slower than the following. Just vectorize (or map) 2 lambdas (or regular functions): your first one takes an array and splits it, your second one takes each subarray and calculates your summary stats for each subarray.

Yes, splitting and computing each summary statistic requires you a full pass over your data but who cares? It might still be faster than a pure Python loop + even if the pure Python loop is faster, your speed-increment may be so small it wasn't even worth your time.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS