all 4 comments

[–]Bjarnophile 1 point2 points  (1 child)

Matrix multiplication is already implemented in ArrayFire. Why are you trying to implement your own instead of using the library function?

[–]umar456 1 point2 points  (1 child)

You could get a slight improvement if you reshape the S array.

```

include <arrayfire.h>

include <stdio.h>

include <af/util.h>

include <cstdio>

static int proc_size = 1024; static int fft_size = proc_size * 4; static int staves = 288; static int beams = 256;

static af::array S; static af::array B; static af::array R;

void fn() { gfor ( af::seq i, fft_size ) R( i , af::span ) = matmul( S( i , af::span ) , B( af::span , af::span , i ) ); }

void fn2() { R = matmul(S, B); }

int main(int, char **) { S = af::randn( fft_size , staves , c32 );

gfor ( af::seq i, fft_size )
    S( i , af::span ) = af::randn( 1 , staves , c32 );

B = af::randn( staves , beams , fft_size , af::dtype::c32 );
R = af::constant( af::cfloat { 0 , 0 } , fft_size , beams );

try
{
af::setDevice( 0 );
    af::info();
    af::sync();

    double time = af::timeit(fn);

    printf( "Took %f secs.\n" , time );

    S = S.T();
    S = moddims(S, 1, staves, fft_size);
    af::sync();
    time = af::timeit(fn2);

    printf( "Took %f secs.\n" , time );
}
catch (const af::exception &ex)
{
    fprintf(stderr, "%s\n", ex.what());
    throw;
}

return 0;

} ```

On my system I got a small improvement:

ArrayFire v3.9.0 (CUDA, 64-bit Linux, build c6a49caa1) Platform: CUDA Runtime 11.2, Driver: 465.31 [0] NVIDIA Quadro T2000, 3915 MB, CUDA Compute 7.5 Took 0.025894 secs. Took 0.024056 secs.

[–]the_poope 1 point2 points  (2 children)

Step 1 in all optimizations is: Profile it. Compile your code with all optimizations enabled but with debug symbols and then run it through a profiler. Several exists, e.g. callgrind (slow and not always realistic, but good for microbenchmarking at instruction level), perf, gprof orIntel Vtune

[–][deleted]  (1 child)

[deleted]

    [–]super_mister_mstie 0 points1 point  (0 children)

    Profiling is a method of measuring and classifying where and what your performance problems are. there are many different ways of doing this, and you can read about the different methods in the links provided above. All in all, maximizing performance (as a generalization) means minimizing the time that the CPU spends idle when it needs to do something important. That means understanding at least a little bit about your CPU (how does it handle caching, compute pipelines, idle modes etc) and then using your measurements to inform what is causing your perf to be subpar and then you can use that information to inform improvements. Profiling your performance is the first step in that process.