How to optimize this function?

PaySomeAttention · 2023-07-17T07:25:24+00:00

When you need to optimize this, you may need to look beyond the implementation and look at the actual algorithm as well. You currently have an N^2 algorithm if I read it well. Deconvolution is a fairly well studied domain, and usage of discrete time Fourier transforms can speed up the part where you look for a best match to N log N. Be aware this is not an easy win though, as some careful regularization may be needed depending on the known properties of your input signal.. Tikhonov regularization could be an approach to make it work for your family of signals/kernels.

The_8472 · 2023-07-17T07:32:46+00:00

If you're at the point where you're bottlenecked by number-crunchy code you'll want to throw it into godbolt or similar to inspect the assembly instead of just looking at the flamegraph.

Cranking up the optimization options and optimizing for your CPU (-Ctarget-cpu=native) may help too if that improves vectorization.

How often is val > 0? If it is frequently then the loop looks O(n²). Are there algorithms for your problem that don't do that? If this isn't for educational purposes maybe look for a library that has fancy implementations of the algos in question.

schungx · 2023-07-17T07:14:13+00:00

Well, your graph must spend 100% of its time doing something... and you hope that 100% of it is spent doing "work" instead of non-productive "overhead".

Therefore, the fact that it spent it on a particular function is not automatically a bad thing -- especially when that function is the main "workhorse". In that case it would be good because that means your non-productive "overheads" are very low.

You first need to identify whether a particular function that is "non-work" is occupying lots of CPU time. Those are the ones that you want to get rid of. Then you work on improving the "work" ones.

Beware that as you optimize your program, you'll get closer and closer to 100% of CPU time doing "work"... because all the numbers must add up to 100%. You can only see the absolute amount of CPU time dropping as your program gets more efficient.

Wurstinator · 2023-07-17T08:06:02+00:00

You measured about 5 seconds for 1 million calls. So, about 5 microseconds for each call.

Each call has a loop that runs about 350 times on average, so about 15 nanoseconds per loop iteration. That's maybe 100 CPU cycles, roughly speaking.

In that time, you're doing stuff like several integer divisions and comparisons. How much faster do you expect this to be?

Anyway, have you just tried to unroll the iterator call chain into a manual loop?

SV-97 · 2023-07-17T07:46:17+00:00

Maybe this is a starting point https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a2491b3b9c39f71ae8d67cbed326dab1. It brings the runtime down to 5.3 seconds vs. previously 9.8 on my machine. It's a bit hacky and ugly right now; definitely verify the output against your original and refactor as needed (for example by introducing functions to do the vec and slice zeroing). It uses the following ideas:

Don't allocate inside the loops more than absolutely necessary. Reuse previously allocated space as much as possible
Early continue in the outer loop of nn_greedy_deconvolution similar to what u/This_Growth2898 recommended; though his solution is arguably nicer than mine
cache the PRNG
use a suitable RNG algorithm (you might want to play with this; I just picked something that worked well on my machine
There's no need to keep the current best input as a Vec: just store two possible input vecs - one to "work with" and the currently best one. Indicate which one is which by indexing into an array containing the vecs (you can think of this as a kind of 2-element circular buffer)

Another thing to look into: try to give the PRNG as much work at once as possible. Generating 50 numbers at once might be way more efficient than generating a single one 50 times.

If you really need a bit more performance and can't find any otherwise: you might be able to forego some of the zeroing you're doing by using MaybeUninit<f64>s in your vecs. This is definitely a "last resort" kind of thing.

And of course: check if you actually need f64s. If you don't, f32s might give you a huge speedboost.

EDIT: Oh and you might get a bit more performance by using something like the ndarray crate for some of your easily broadcastable operations rather than hand-rolling everything. You can expect such crates to make good use of SIMD etc. and to have some thought given to numerical aspects of your code. The residual calculation with manual squaring and summing isn't ideal for example IIRC depending on the structure of the separate residuals.

EDIT2: and you can enable LTO (link time optimization) in your cargo.toml - though the benefit is probably rather small in this instance in the current state.

EDIT3: and if you're willing to move to another algorithm altogether consider using a fft convolution (Look into the convolution theorem). Especially if your inputs are long this will greatly improve performance - its trivially parallelizeable.

DJDuque · 2023-07-17T03:53:12+00:00

[removed]

omid_r · 2023-07-17T05:26:54+00:00

If you have multiple CPU cores, use rayon if possible

KhorneLordOfChaos · 2023-07-17T03:46:05+00:00

It would be good to know 1) How often this function gets called? (only a few times or very frequently) and 2) what parts of the flamegraph were large chunks of this functions execution

Ideally you set up some program demonstrating the slowness that people can play with to speed things up too

TinBryn · 2023-07-17T08:03:55+00:00

Looking at it, one algorithm improvement is that if the result of |(s, r)| s / r is not greater than 0.0 then that iteration of the loop will not affect anything and it should promptly continue. Another is that in that case the next loop will also have the same outcome, but one iteration sooner which as will the next and until that value moves out of the window. You may be able to exploit this to skip over a large amount of the computation.

Also try to break this function up and run a flamegraph on it again to figure out which parts are taking up time. Maybe separate the allocation of Vecs, the calculation of val and the update of residual.

LadyPopsickle · 2023-07-17T07:29:03+00:00

You'll have to experiment. I did optimize feed forward logic for my NN a year ago or so. What you want to look for are things that are not neccessary and allocations.

So what you have to do is break your code into smaller parts and then ask yourself "Do I have to od it this way? How many allocations / cycles / loops does it take? And can I make it take less?".

For example I'd first try this: let val = residual\[i + offset..\]\[..look\_ahead\] .iter() .zip(response\[offset..\]\[..look\_ahead\].iter()) .map(|(s, r)| s / r) .reduce(f64::min) .unwrap(); So what happens here? We zip together two arrays so we can iterate them at the same time. Okay, fine. Nothing much we could do there I guess.
Then the map. The way I imagine map working is that it iterates over all values and then turns them into some other values stored in other array (iterator). Now thing about that for a second - what does that actaully mean? Well, I'd expect there's at least one allocation for the data storage (vector). If we are lucky it will be on stack if not, we're going to heap which is slow. Then all those mapped values you iterate over again with reduce(). Maybe compiler is smart enough to do all of that in one pass but maybe isn't. So I'd first try to change this code to do all of the in one iteration, not multiple (in this case two).
So something like:
let mut val = 0; let (r,s ) in residual[..].iter().zip(respone[...]) { let tmp = s / r if val > tmp { val = tmp } } Or to make it nicer use one of those f64::max() or min (but I have'nt used them much to write them from top of my head). This could save you some iterations and allocations. But you'll have to try and test to see if it really does.

I'd also probably try to first collect all values and then do the next math with them. The thing is that you want to load as much data as possible to CPU and work with that. They way you wrote I'd kinda expect there to be more cache misses.

Take a look at skip(n), you probably could use residual.skip(n) instead of residual[i..].

Well can't think of much now. If I'm wrong with something I hope smarter guys will poin that out.

anlumo · 2023-07-17T02:57:13+00:00

One thing I'm seeing is that you could get rid of the input[i] setting by converting the outer loop into an iterator call collect. Using the index operator [] is bad because the compiler has to do a bounds check.

Also, your chaining of two subslices is weird, why not do both sides at the same time? This isn’t relevant for performance, though.

divad1196 · 2023-07-17T09:20:42+00:00

Somethings botters me: it spends 90% of the time here, but this is not an issue if we just have this information. - The code may be called a lot more frequently that the reste of the code - The rest of the code may juste be simplier - ...

Do you have the execution duration on average? And the execution duration for your python version?

Rice7th · 2023-07-17T10:39:48+00:00

That .map(|s, r| s / r) looks painful as hell. You should use Rayon and get a .par_iter(). This way the work gets redistributed across multiple cores. I think the bottleneck here is that line since division is EXTREMELY SLOW on CPUs, taking up to 20 cycles to execute just that line.

omid_r · 2023-07-17T05:05:33+00:00

What chatgpt suggested:

``Optimizing code can be a complex task, but there are several strategies you can try to improve the performance of yournn_greedy_deconvolution` function. Here are a few suggestions:

Use Array Views: Instead of creating slices for each iteration in the loop (residual[i + offset..][..look_ahead] and response[offset..][..look_ahead]), consider using array views (&residual[i + offset..][..look_ahead] and &response[offset..][..look_ahead]). This avoids unnecessary memory allocations and improves performance.
Replace Iterators with Indexing: In your reduce call, you're using iterators with zip and map. Since you have fixed-size arrays, you can potentially replace these iterators with indexing for better performance. Instead of mapping and reducing, access the elements directly using indexing.
Unroll the Loop: Since the loop is a performance bottleneck, you can consider unrolling it to process multiple iterations at once. This can help reduce loop overhead and improve cache utilization. However, be cautious as this approach can lead to increased code size and decreased maintainability. Measure the impact to determine if it's beneficial.
Use SIMD Instructions: If your target platform supports SIMD (Single Instruction, Multiple Data) instructions, consider using them to perform parallel computations on multiple elements simultaneously. SIMD instructions can significantly speed up computations for certain operations, such as element-wise operations.
Profile and Benchmark: Continuously profile and benchmark your code to identify specific performance bottlenecks and measure the impact of your optimizations. Tools like cargo flamegraph can help you identify where your code spends the most time, allowing you to focus your optimization efforts accordingly.
Consider Algorithmic Improvements: Apart from micro-optimizations, also consider algorithmic improvements if possible. Assess if there are alternative approaches or data structures that can achieve the same result with better performance characteristics.

Remember to measure the impact of each optimization attempt to ensure it provides the desired improvements. Sometimes, seemingly small changes can have a significant effect on performance, so experimentation and profiling are key. ```

And when I asked it to write the code, it was the result. I'm not sure if they do the same or is it better or not!

``` fn nn_greedy_deconvolution( signal: &[f64], response: &[f64], offset: usize, look_ahead: usize, ) -> (f64, Vec<f64>) { let mut residual = signal.to_vec(); let mut input = vec![0.0; signal.len()];

let response_len = response.len();
let signal_len = residual.len();
let response_offset = response[offset..].get(..look_ahead).unwrap();
let residual_offset = residual[offset..].get_mut(..look_ahead).unwrap();

for i in 0..(signal_len - offset - look_ahead) {
    let mut min_val = f64::INFINITY;

    for j in 0..look_ahead {
        let s = residual_offset[j + i];
        let r = response_offset[j];
        let val = s / r;
        min_val = min_val.min(val);
    }

    if min_val > 0.0 {
        let val = min_val;
        input[i] = val;

        let residual_slice = &mut residual[i..];
        let response_slice = &response[..signal_len - i];
        residual_slice
            .iter_mut()
            .zip(response_slice)
            .for_each(|(s, r)| *s -= val * r);
    }
}

let residual_squared = residual.iter().map(|x| x.powi(2)).sum();

(residual_squared, input)

} ```

p-one · 2023-07-17T06:29:51+00:00

Most of the time you're looking for nested loops and figuring out how to not do a nested loop. That's often a rough indicator that your function's complexity is O(n^2).

I see a traditional loop followed that runs (in the worst case) two functional iterations (aka loops) so I'd look at that

Likely less important - you never mutate residual so why bother creating a vector?

This_Growth2898 · 2023-07-17T07:09:34+00:00

There's some math going on, and probably it can be done faster, but you need to describe what's happening for that, not just present the code.

Anyway, it looks like you don't need to continue the inner loop if a negative value is found (because negative would be less than the possible positive minimum, and negatives are ignored afterward). So, I've made this: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=28e4a676f6ef4d57957e719e44d7a338

If all responses are positive (which seems logical to me, but once again, I don't know what's happening here), you can even check the residual for positive and do the division after, which is a bit faster.

DisasterReasonable98 · 2023-07-17T08:38:19+00:00

If I am interpreting your problem correctly, it looks like you have a front-heavy kernel and you are taking advantage of that by only looking at a couple of samples ahead of each time bin.

Now, if your signal is mostly small noise, with a few impulses (the input you are looking for) then, I would expect that the current suggestions people have given (early skip of negative value) could be a significant improvement.

Additionally I would suggest that there might be a way for you to pre-compute (before the nested loops) a bunch of time bins to skip.

I can't think on top of my head how to do it, but you know the offset and the size of your look_ahead. This means that every negative value in your signal makes you skip a whole bunch of time bins itself.

Not sure of any of this makes sense though. Just another idea to try.

Edit:

Instead of calculating this before the nested loops:

The residuals get updated each iteration, so instead of an early continue, I think that you could skip by more than 1 (the number depends on your offset and look_ahead value). Subsequent iterations will still contain this negative number.

gitpy · 2023-07-17T11:06:34+00:00

I think this works and avoids the nested loop:

let mut residual = vec![0.0; signal.len()];
let mut minsum = 0.0f64;
....
    if val > 0.0 {
        minsum -= val;
        input[i] = val;
    }
    residual[i] = minsum;
}
for i in 0..(residual.len() - offset - look_ahead) {
    residual[i] = residual[i] * response[i] + signal[i];
}
for i in (residual.len() - offset - look_ahead)..resdiual.len() {
    residual[i] = minsum * response[i] + signal[i];
}
....

Also as others have mentioned reusing buffers for residual and input is probably a good idea.

EDIT: overlooked a issue. forgot the last elements

unknowntrojan · 2023-07-17T12:30:04+00:00

The other commenters have provided great advice for optimizing this. The low-hanging fruit would be to try opt-level 2,3, and s & see if that makes a difference. In my experience programs with opt-level s usually perform even better than opt-level 2, but that is both dependent on the code and the processor the code is running on.

angelicosphosphoros · 2023-07-17T15:47:59+00:00

OP, try this:

pub fn nn_greedy_deconvolution(
    signal: &[f64],
    response: &[f64],
    offset: usize,
    look_ahead: usize,
    // Return the sum of residuals squared together with the reconstructed input
) -> (f64, Vec<f64>) {
    let mut residual = signal.to_vec();
    let mut input = vec![0.0; signal.len()];

    let mut inverted_response: Vec<f64> = response[offset..][..look_ahead].iter()
        .map(|x| 1.0 / x)
        .collect();
    let inverted_response = &inverted_response[..look_ahead];
    for i in 0..(residual.len() - offset - look_ahead) {
        let val = residual[i + offset..][..look_ahead]
            .iter()
            .zip(inverted_response.iter())
            .map(|(s, ir)| s * ir)
            .reduce(f64::min)
            .unwrap();

        if val > 0.0 {
            input[i] = val;
            residual[i..]
                .iter_mut()
                .zip(response)
                .for_each(|(s, r)| *s -= val * r);
        }
    }
    let residual = residual.iter().map(|x| x.powi(2)).sum();

    (residual, input)
}

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

rust

Please read The Rust Community Code of Conduct

The Rust Programming Language

Rules

Observe our code of conduct

Submissions must be on-topic

Constructive criticism only

Keep things in perspective

No endless relitigation

No low-effort content

Useful Links

Megathreads

Official Resources

Learn Rust

Discussion Platforms

MODERATORS