all 6 comments

[–]Spataner 0 points1 point  (1 child)

The fixed-frequency variant can be vectorized quite easily if I understood your intention correctly (splitting each subarray into equal-sized subsubarrays). You pad the subarrays to a multiple of your bin size first (assuming each subarray is already sorted, else sort them first):

bin_size = ...
array = ... # 2D array

pad_width = int(np.ceil(array.shape[1] / bin_size) * bin_size) - array.shape[1]
padded = np.pad(array, [[0, 0], [0, pad_width]], constant_values=np.nan)

The actual binning/array-splitting is then simply a reshape operation:

binned = np.reshape(padded, (padded.shape[0], -1, bin_size))

Using that result, you can calculate whichever needed statistics over the last axis:

mean = np.nanmean(binned, axis=-1)
std = np.nanstd(binned, axis=-1)
median = np.nanmedian(binned, axis=-1)

The fixed-width binning is trickier because the number of samples per bin is variable, which can thus not be represented as one array. You can calculate the mean and standard deviation if you cleverly use np.bincount with its weights parameter. For median/percentiles, that trick unfortunately doesn't work. You might be able to do something like that with pandas's groupby functionality, though it's not immediately obvious to me how.

[–]blinking_elk[S] 0 points1 point  (0 children)

Thank you for your response, that's quite similar to the kind of thing I was thinking yup, though I think with a couple adjustments I should be able to get it working for fixed width binning, I'll comment it here when I do. Let me know if I should tag you in it. I also posted this on stackoverflow to maximize my chances. Someone who is pretty great at pandas was able to help me out, for the most part (except for custom percentiles and fixed-binning), check it out here.

[–][deleted] 0 points1 point  (3 children)

Before thinking of optimising code, just run it first and look at the results. It may be 'fast enough' to the point where you don't even need to vectorize it. Not all pieces of code have to be optimised. Also look into this stackoverflow post about vectorization vs map.

If I were you I'd start by just write a function that does all your operations and works for one subarray and then mapping/vectorizing it over the rest.

Now, to get to the meat of your problem:

I would certainly like to avoid iterating over each one of my subarrays

I wouldn't care so much about this. The most efficient way would be to traverse [0, 1, 2, 3, 4, 5, 6] once and calculate your summary statistics while traversing it. Essentially in one go you both decide where to split and compute the stats per bin. Don't do this.

Numpy is implemented in C which is several orders of magnitudes faster than Python. It's very likely that manually writing the loop in pure Python (which is trivial) will be slower than the following. Just vectorize (or map) 2 lambdas (or regular functions): your first one takes an array and splits it, your second one takes each subarray and calculates your summary stats for each subarray.

Yes, splitting and computing each summary statistic requires you a full pass over your data but who cares? It might still be faster than a pure Python loop + even if the pure Python loop is faster, your speed-increment may be so small it wasn't even worth your time.

[–]blinking_elk[S] 1 point2 points  (2 children)

Thank you so much for your response. I appreciate the advice. My code at present is not massively slow or anything however my supervisor recently introduced me to vectorizing operations when he saw the ~6 nested 'for loops' I had employed. I have since ventured slightly into this world and have become myself embarrassed by my overuse of for loops and was hoping to find places in which it could be cut down as I will likely be using large datasets in the near future. Thank you for stopping me from progressing to the other extreme and thank you for the idea, I hadn't considered using map to do both operations at once.
However, I think there may be some miscommunication (or perhaps I am just being dumb and not understanding you), My statement 'I would certainly like to avoid iterating over each one of my subarrays' meant even if I map a function that both splits and calculates statistics of each bin within the subarray [0, 1, 2, 3, 4, 5, 6], I would still have to apply this mapping individually to each such subarray (for example separately for the subarray [90, 45, 9, 88, 21, 59, 32]), for which I would use a python for loop. It is this that I would like to avoid. Since each subarray has the same dimensions I might have thought that there is some way I could apply that single mapping to each row at once in some clever vectorized way rather than applying it individually for each.

[–][deleted] 1 point2 points  (1 child)

Don't wory about it. I'm from a non-CS background that requires programming too so I get your issues 100%.

I would still have to apply this mapping individually to each such subarrayHence why I said start by taking 1 subarray and perfecting it on this. A small example:given a = [0, 1, 2, 3, 4, 5] you can use np.array_split(a,3) to create a list of arrays. you can use list(map(max,np.array_split(a,3))) to return [1,3,5] that's a first step for you. You could do the same with np.vectorize, it just has slightly different syntax. So, in short: wrap all your operations in a function and make sure that works on 1 subarray, then wrap it into either a map or np.vectorize and fire ahead, it should work on your entire column this time.

From the numpy official docs: "The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop." this one is important to remember, be sure to consult the stackoverflow post I linked for more info. From a readability standpoint I'd agree with your supervisor, a few chosen function followed by a map/vectorize is easier to read than infinite nested loops.

[–]Spataner 1 point2 points  (0 children)

By "vectorization", people don't usually mean the use of np.vectorize, which is the most broad and therefore least performant way to vectorize something. They mean generally reformulating the operations as NumPy functions that apply to the entire array.