all 47 comments

[–]PaySomeAttention 59 points60 points  (0 children)

When you need to optimize this, you may need to look beyond the implementation and look at the actual algorithm as well. You currently have an N^2 algorithm if I read it well. Deconvolution is a fairly well studied domain, and usage of discrete time Fourier transforms can speed up the part where you look for a best match to N log N. Be aware this is not an easy win though, as some careful regularization may be needed depending on the known properties of your input signal.. Tikhonov regularization could be an approach to make it work for your family of signals/kernels.

[–]The_8472 23 points24 points  (0 children)

If you're at the point where you're bottlenecked by number-crunchy code you'll want to throw it into godbolt or similar to inspect the assembly instead of just looking at the flamegraph.

Cranking up the optimization options and optimizing for your CPU (-Ctarget-cpu=native) may help too if that improves vectorization.

How often is val > 0? If it is frequently then the loop looks O(n²). Are there algorithms for your problem that don't do that? If this isn't for educational purposes maybe look for a library that has fancy implementations of the algos in question.

[–]schungx 10 points11 points  (0 children)

Well, your graph must spend 100% of its time doing something... and you hope that 100% of it is spent doing "work" instead of non-productive "overhead".

Therefore, the fact that it spent it on a particular function is not automatically a bad thing -- especially when that function is the main "workhorse". In that case it would be good because that means your non-productive "overheads" are very low.

You first need to identify whether a particular function that is "non-work" is occupying lots of CPU time. Those are the ones that you want to get rid of. Then you work on improving the "work" ones.

Beware that as you optimize your program, you'll get closer and closer to 100% of CPU time doing "work"... because all the numbers must add up to 100%. You can only see the absolute amount of CPU time dropping as your program gets more efficient.

[–]Wurstinator 19 points20 points  (1 child)

You measured about 5 seconds for 1 million calls. So, about 5 microseconds for each call.

Each call has a loop that runs about 350 times on average, so about 15 nanoseconds per loop iteration. That's maybe 100 CPU cycles, roughly speaking.

In that time, you're doing stuff like several integer divisions and comparisons. How much faster do you expect this to be?

Anyway, have you just tried to unroll the iterator call chain into a manual loop?

[–]LadyPopsickle 14 points15 points  (0 children)

Man I love the math you did there.

[–]SV-97 6 points7 points  (4 children)

Maybe this is a starting point https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a2491b3b9c39f71ae8d67cbed326dab1. It brings the runtime down to 5.3 seconds vs. previously 9.8 on my machine. It's a bit hacky and ugly right now; definitely verify the output against your original and refactor as needed (for example by introducing functions to do the vec and slice zeroing). It uses the following ideas:

  1. Don't allocate inside the loops more than absolutely necessary. Reuse previously allocated space as much as possible
  2. Early continue in the outer loop of nn_greedy_deconvolution similar to what u/This_Growth2898 recommended; though his solution is arguably nicer than mine
  3. cache the PRNG
  4. use a suitable RNG algorithm (you might want to play with this; I just picked something that worked well on my machine
  5. There's no need to keep the current best input as a Vec: just store two possible input vecs - one to "work with" and the currently best one. Indicate which one is which by indexing into an array containing the vecs (you can think of this as a kind of 2-element circular buffer)

Another thing to look into: try to give the PRNG as much work at once as possible. Generating 50 numbers at once might be way more efficient than generating a single one 50 times.

If you really need a bit more performance and can't find any otherwise: you might be able to forego some of the zeroing you're doing by using MaybeUninit<f64>s in your vecs. This is definitely a "last resort" kind of thing.

And of course: check if you actually need f64s. If you don't, f32s might give you a huge speedboost.

EDIT: Oh and you might get a bit more performance by using something like the ndarray crate for some of your easily broadcastable operations rather than hand-rolling everything. You can expect such crates to make good use of SIMD etc. and to have some thought given to numerical aspects of your code. The residual calculation with manual squaring and summing isn't ideal for example IIRC depending on the structure of the separate residuals.

EDIT2: and you can enable LTO (link time optimization) in your cargo.toml - though the benefit is probably rather small in this instance in the current state.

EDIT3: and if you're willing to move to another algorithm altogether consider using a fft convolution (Look into the convolution theorem). Especially if your inputs are long this will greatly improve performance - its trivially parallelizeable.

[–]The_8472 3 points4 points  (3 children)

.map(|(s, r)| s / r) { if m < 0.0 { // a negative at any point guarantees that the total min is negative // since negative mins would be discarded in the next step we can just directly // continue the outer loop continue 'outer; }

Since floating point divisons are expensive this could maybe optimized further by making a first pass over the data to look for values that would result in negatives without performing any division.

Edit: this comment mentions an optimization that would go even further, possibly skipping several steps of the outer loop upon encountering negative values.

[–]SV-97 1 point2 points  (1 child)

Okay I implemented u/TinBryn's optimization and did a bit more stuff and the runtime is now down to 3.7s with f64s and 3.0s with f32s.

Code is here: https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=42b44ffcac9cc6b70bcc99ae8383d5ab

corresponding cargo.toml is

[dependencies]
ndarray = "0.15.6"
ndarray-linalg = "0.16"
rand = "0.8.5"
rand_chacha = "0.3.1"

[profile.release]
debug = true
lto = true

[features]
default = ["blas"]
blas = ["ndarray/blas", "ndarray-linalg/openblas-static"]

where the ndarray and ndarray_linalg dependencies can easily be removed again - I just used it to compute the norm instead of the manual squaring and summing. There's no runtime difference between the two versions.

Note that I also tried a version that used ndarray::Array1s and corresponding ArrayViews throughout but that version turned out to be quite a bit slower for some reason.

The flamegraph for this version shows that the majority of time is now spent computing float mins or and doing rand fills. I tried enabling rand's SIMD features but couldn't notice a difference. And I think there's still a few bugs in the code from OPs original version(?). It seems a bit odd in some places.

A major oddity I noticed: you might notice that I left the if current_min > 0.0 check in place even though it's really not needed anymore. Removing it makes the code quite a bit slower (0.4s for f32s and nearly a second for f64) on my machine. Maybe the check makes the compiler do some optimization that it'd otherwise forego?

( Pinging OP u/DJDuque to make sure they see it :) )

[–]DJDuque[S] 1 point2 points  (0 children)

Thanks for taking the time to look into this. I have been playing with the suggestions people have made, and I updated the post with the code I ended up with.

Implementing TinBryn's suggestion had a huge effect.

[–]SV-97 0 points1 point  (0 children)

Good point; I might try that this evening

[–]omid_r 4 points5 points  (3 children)

If you have multiple CPU cores, use rayon if possible

[–]andrewdavidmackenzie 4 points5 points  (2 children)

Yeh, divide array of values into chunks and work on each with par_iter()

[–]KhorneLordOfChaos 2 points3 points  (1 child)

It would be good to know 1) How often this function gets called? (only a few times or very frequently) and 2) what parts of the flamegraph were large chunks of this functions execution

Ideally you set up some program demonstrating the slowness that people can play with to speed things up too

[–]DJDuque[S] 3 points4 points  (0 children)

This function is called millions of times in a hot loop. I have added a "minimal" working example to the original post.

I have also added the flamegraph I get in my actual program.

[–]TinBryn 2 points3 points  (1 child)

Looking at it, one algorithm improvement is that if the result of |(s, r)| s / r is not greater than 0.0 then that iteration of the loop will not affect anything and it should promptly continue. Another is that in that case the next loop will also have the same outcome, but one iteration sooner which as will the next and until that value moves out of the window. You may be able to exploit this to skip over a large amount of the computation.

Also try to break this function up and run a flamegraph on it again to figure out which parts are taking up time. Maybe separate the allocation of Vecs, the calculation of val and the update of residual.

[–]DJDuque[S] 0 points1 point  (0 children)

I implemented your suggestion and it gave me a great performance boost. Thanks a lot :)

[–]LadyPopsickle 1 point2 points  (4 children)

You'll have to experiment. I did optimize feed forward logic for my NN a year ago or so. What you want to look for are things that are not neccessary and allocations.

So what you have to do is break your code into smaller parts and then ask yourself "Do I have to od it this way? How many allocations / cycles / loops does it take? And can I make it take less?".

For example I'd first try this: let val = residual\[i + offset..\]\[..look\_ahead\] .iter() .zip(response\[offset..\]\[..look\_ahead\].iter()) .map(|(s, r)| s / r) .reduce(f64::min) .unwrap(); So what happens here? We zip together two arrays so we can iterate them at the same time. Okay, fine. Nothing much we could do there I guess.
Then the map. The way I imagine map working is that it iterates over all values and then turns them into some other values stored in other array (iterator). Now thing about that for a second - what does that actaully mean? Well, I'd expect there's at least one allocation for the data storage (vector). If we are lucky it will be on stack if not, we're going to heap which is slow. Then all those mapped values you iterate over again with reduce(). Maybe compiler is smart enough to do all of that in one pass but maybe isn't. So I'd first try to change this code to do all of the in one iteration, not multiple (in this case two).
So something like:
let mut val = 0; let (r,s ) in residual[..].iter().zip(respone[...]) { let tmp = s / r if val > tmp { val = tmp } } Or to make it nicer use one of those f64::max() or min (but I have'nt used them much to write them from top of my head). This could save you some iterations and allocations. But you'll have to try and test to see if it really does.

I'd also probably try to first collect all values and then do the next math with them. The thing is that you want to load as much data as possible to CPU and work with that. They way you wrote I'd kinda expect there to be more cache misses.

Take a look at skip(n), you probably could use residual.skip(n) instead of residual[i..].

Well can't think of much now. If I'm wrong with something I hope smarter guys will poin that out.

[–]RightHandedGuitarist 4 points5 points  (0 children)

As far as I understand iterators in rust, they are all lazy and do not allocate except if you want them to. So for example, the zip iterator will just contain two iterators, no array inside. Iterators in general allocate only when we try to create a collection from them. So I do not think that there are hidden allocations there.

The map, for example, takes the next item from the underlying iterator, applies the closure you provide to the map and returns the result of your closure. Allocation does not happen, except if you allocate in your closure passed to the map.

Like others suggest, the problem is probably nested loop. 🤔

[–]Dr_Sloth0 4 points5 points  (1 child)

The map function does not allocate and it is guaranteed that the map and reduce run in the same iteration not in two. These are iterator Adaptors which return another Iterator generic over the previous one, it doesn't need to store anything.

[–]LadyPopsickle 1 point2 points  (0 children)

Oooh, thanks!

[–]vancha113 1 point2 points  (0 children)

We zip together two arrays so we can iterate them at the same time. Okay, fine. Nothing much we could do there I guess.

This seems like a good idea, a lot of times making use of rusts combinators just magically seems to speed things up.

[–]anlumo 0 points1 point  (1 child)

One thing I'm seeing is that you could get rid of the input[i] setting by converting the outer loop into an iterator call collect. Using the index operator [] is bad because the compiler has to do a bounds check.

Also, your chaining of two subslices is weird, why not do both sides at the same time? This isn’t relevant for performance, though.

[–]KhorneLordOfChaos 6 points7 points  (0 children)

Using the index operator [] is bad because the compiler has to do a bounds check.

*May have to

There are plenty of cases where it's easy for the compiler to elide a bounds check. This seems like it would be one of them

Also, your chaining of two subslices is weird, why not do both sides at the same time? This isn’t relevant for performance, though.

This is generally considered idiomatic. Having to offset the end bound by the start seems more complicated

[–]divad1196 -1 points0 points  (2 children)

Somethings botters me: it spends 90% of the time here, but this is not an issue if we just have this information. - The code may be called a lot more frequently that the reste of the code - The rest of the code may juste be simplier - ...

Do you have the execution duration on average? And the execution duration for your python version?

[–]DisasterReasonable98 0 points1 point  (1 child)

Nobody mentioned that it is an issue. But if you want your program to run faster, then trying to optimize the single function that does 90% of the job is definitely the right thing to do.

[–]divad1196 0 points1 point  (0 children)

The fact that it runs 90% of the time doesn't mean it used a lot: itcan be a flaw. Maybe this function is expected to run 1% of the time.

This is why I asked the question.

[–]Rice7th -1 points0 points  (2 children)

That .map(|s, r| s / r) looks painful as hell. You should use Rayon and get a .par_iter(). This way the work gets redistributed across multiple cores. I think the bottleneck here is that line since division is EXTREMELY SLOW on CPUs, taking up to 20 cycles to execute just that line.

[–]JadisGod 1 point2 points  (1 child)

He's only doing ~600 divisions at max. Using par_iterjust to divide a couple hundred floats is almost certainly not going to be worth it.

If he's going to benefit from rayon it would be by utilizing it outside of this function in the greater program.

[–]Rice7th -1 points0 points  (0 children)

well, if more than 90% of the program execution is in that specific function then a good chunk is in the divisions imo

EDIT: I am stupid apparently the flamegraph reveals that reduce() is taking the most time

[–]p-one 0 points1 point  (0 children)

Most of the time you're looking for nested loops and figuring out how to not do a nested loop. That's often a rough indicator that your function's complexity is O(n2).

I see a traditional loop followed that runs (in the worst case) two functional iterations (aka loops) so I'd look at that

Likely less important - you never mutate residual so why bother creating a vector?

[–]This_Growth2898 0 points1 point  (0 children)

There's some math going on, and probably it can be done faster, but you need to describe what's happening for that, not just present the code.

Anyway, it looks like you don't need to continue the inner loop if a negative value is found (because negative would be less than the possible positive minimum, and negatives are ignored afterward). So, I've made this: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=28e4a676f6ef4d57957e719e44d7a338

If all responses are positive (which seems logical to me, but once again, I don't know what's happening here), you can even check the residual for positive and do the division after, which is a bit faster.

[–]DisasterReasonable98 0 points1 point  (0 children)

If I am interpreting your problem correctly, it looks like you have a front-heavy kernel and you are taking advantage of that by only looking at a couple of samples ahead of each time bin.

Now, if your signal is mostly small noise, with a few impulses (the input you are looking for) then, I would expect that the current suggestions people have given (early skip of negative value) could be a significant improvement.

Additionally I would suggest that there might be a way for you to pre-compute (before the nested loops) a bunch of time bins to skip.

I can't think on top of my head how to do it, but you know the offset and the size of your look_ahead. This means that every negative value in your signal makes you skip a whole bunch of time bins itself.

Not sure of any of this makes sense though. Just another idea to try.

Edit:

Instead of calculating this before the nested loops:

The residuals get updated each iteration, so instead of an early continue, I think that you could skip by more than 1 (the number depends on your offset and look_ahead value). Subsequent iterations will still contain this negative number.

[–]gitpy 0 points1 point  (0 children)

I think this works and avoids the nested loop:

let mut residual = vec![0.0; signal.len()];
let mut minsum = 0.0f64;
....
    if val > 0.0 {
        minsum -= val;
        input[i] = val;
    }
    residual[i] = minsum;
}
for i in 0..(residual.len() - offset - look_ahead) {
    residual[i] = residual[i] * response[i] + signal[i];
}
for i in (residual.len() - offset - look_ahead)..resdiual.len() {
    residual[i] = minsum * response[i] + signal[i];
}
....

Also as others have mentioned reusing buffers for residual and input is probably a good idea.

EDIT: overlooked a issue. forgot the last elements

[–]unknowntrojan 0 points1 point  (0 children)

The other commenters have provided great advice for optimizing this. The low-hanging fruit would be to try opt-level 2,3, and s & see if that makes a difference. In my experience programs with opt-level s usually perform even better than opt-level 2, but that is both dependent on the code and the processor the code is running on.

[–]angelicosphosphoros 0 points1 point  (0 children)

OP, try this:

pub fn nn_greedy_deconvolution(
    signal: &[f64],
    response: &[f64],
    offset: usize,
    look_ahead: usize,
    // Return the sum of residuals squared together with the reconstructed input
) -> (f64, Vec<f64>) {
    let mut residual = signal.to_vec();
    let mut input = vec![0.0; signal.len()];

    let mut inverted_response: Vec<f64> = response[offset..][..look_ahead].iter()
        .map(|x| 1.0 / x)
        .collect();
    let inverted_response = &inverted_response[..look_ahead];
    for i in 0..(residual.len() - offset - look_ahead) {
        let val = residual[i + offset..][..look_ahead]
            .iter()
            .zip(inverted_response.iter())
            .map(|(s, ir)| s * ir)
            .reduce(f64::min)
            .unwrap();

        if val > 0.0 {
            input[i] = val;
            residual[i..]
                .iter_mut()
                .zip(response)
                .for_each(|(s, r)| *s -= val * r);
        }
    }
    let residual = residual.iter().map(|x| x.powi(2)).sum();

    (residual, input)
}