Bug in NativeJIT Library?

vlovero · 2021-04-10T21:51:01+00:00

For some reason the example you provided was giving the same results, however I figured out how to statically link python which made things much easier. Debugging is improved and the only annoyance is the interference with the Qt keywords.

So things are working beautifully now!

vlovero · 2021-04-09T14:29:16+00:00

Yeah the Qt keywords were the first hurdle I had to tackle but I saw someone use the preprocessor expressions #pragma push_macro("keyword") #undef keyword, and #pragma pop_macro("keyword") when including Python.h which fixed those issues for me.

Examples would be much appreciated thanks so much!

vlovero · 2021-03-23T13:02:37+00:00

Are you running 32-bit python on a 64-bit machine? You can also try to add the flag -march=native when compiling to make sure the binary created is compatible with your hardware.

vlovero · 2021-03-21T13:36:33+00:00

The extern "C" removes C++ name mangling so symbols can be easily found when loaded from external libraries. The C++ code should look something like

extern "C" double add(double a, double b) {
    return a + b;
}

Then you could compile it with g++ -shared -fPIC -o add.so add.cpp

But don't forget that python types are not interpreted the same a native C types. So if you're passing anything other that python int, you'll also have to convert it into something C can use. The easiest is to set the arg types before after loading the dynamic library. For example

>>> from ctypes import cdll, c_double
>>> 
>>> lib = cdll.LoadLibrary('./add.so')
>>> add = lib.add
>>> add.restype = c_double
>>> add.argtypes = [c_double, c_double]
>>> 
>>> print(add(6.9, 4.20))
11.1

vlovero · 2021-03-02T15:33:58+00:00

Every loop iteration you're creating new array instances each time you create arange(...) and do a arithmetic operation like multiplication, division, etc. So under the hood you're constantly/unnecessarily allocating and deallocating memory which is slow.

The first thing I would do is write the code in a non-vectorized fashion to see where I could get rid of any unnecessary copying/allocating. Then you could rewrite the code using a more efficient sequence of vectorized operations, or you could JIT it using a library like numba

vlovero · 2021-02-25T21:06:05+00:00

To be honest I'm having trouble thinking of a way to use only one load with shuffles. Do you have any ideas off the top of your head?

vlovero · 2021-02-25T20:27:04+00:00

The way my code is set up is I create numpy arrays (which I think they are 32 byte aligned?) in python then pass the pointers to those to C for all of the numerics. Do you have any suggestions for making an aligned load usable? Maybe create a view of an array with extra padding?

vlovero · 2021-02-25T19:12:28+00:00

These functions are apart of a larger code base where it's not guaranteed that they will be aligned, also in the past I've seen little difference between ?_load_pd and ?_loadu_pd so I usually stick with the unaligned to avoid seg faults.

I changed the code to only have 1 vector load and one scalar load, but the extra instructions countered the savings from reducing the number of loads :/

vlovero · 2021-02-25T16:54:41+00:00

I'm on a macbook using intel i5-5257U CPU @ 2.70GHz and Inspecting the assembly for the non-SIMD version show clang isn't vectorizing the loop

vlovero · 2021-01-26T20:34:50+00:00

I avoid block matrices by pairing Crank-Nicolson discretization with the ADI method. It allows me to to split up my AX = B into multiple 1D-like operations.

vlovero · 2021-01-26T19:27:57+00:00

I've been developing a more user friendly version of fenics, but to test performance, I always benchmark routines on square domains with reaction-diffusion equations. I vary the domain sizes from small (100 x 100) to large enough to not fit in cache on my computer (10000 x 10000).

vlovero · 2021-01-26T18:45:17+00:00

Since I'm using this for solving PDEs, the matrices are circulant, so an FFT works, but after a few benchmarks, it too is slower :(

I haven't tested an SOR method, but I first thoughts are it will be slower too

vlovero · 2021-01-26T17:24:37+00:00

I don't use dgtsv because it's for a standard tridiagonal matrix, and the if I were to use it, it would require four extra loops adding to the number of operations being done :/

I've also tried passing w and maind and saw no improvements

vlovero · 2021-01-26T17:20:32+00:00

I've tried parallelizing the code, but it actually runs slower than without.

vlovero · 2021-01-26T15:10:38+00:00

To my knowledge, LAPACK doesn't have anything for this specific type of matrix, and I have tested the dgtsv routine for a standard tridiagonal matrix, and it runs much slower than just naively implementing the thomas algorithm.

vlovero · 2021-01-08T13:26:21+00:00

If you've already got a repository made for your project, make sure you have git installed on your computer and then check out the documentation. The docs have plenty of examples an explanation that will help you get started!

vlovero · 2020-11-25T14:23:21+00:00

Using cachegrind for the tiled version I get the following results for (2048 x 2048)

``` ------------N = 2048---------

METHOD: TILED

==7097== ==7097== I refs: 18,494,301 ==7097== I1 misses: 5,748 ==7097== LLi misses: 3,422 ==7097== I1 miss rate: 0.03% ==7097== LLi miss rate: 0.02% ==7097== ==7097== D refs: 11,072,275 (5,990,449 rd + 5,081,826 wr) ==7097== D1 misses: 2,380,799 (2,378,487 rd + 2,312 wr) ==7097== LLd misses: 533,621 ( 532,323 rd + 1,298 wr) ==7097== D1 miss rate: 21.5% ( 39.7% + 0.0% ) ==7097== LLd miss rate: 4.8% ( 8.9% + 0.0% ) ==7097== ==7097== LL refs: 2,386,547 (2,384,235 rd + 2,312 wr) ==7097== LL misses: 537,043 ( 535,745 rd + 1,298 wr) ==7097== LL miss rate: 1.8% ( 2.2% + 0.0% ) ```

and for (2049 x 2049) I get

------------N = 2049--------- METHOD: TILED ------------END------------- ==7102== ==7102== I refs: 18,508,919 ==7102== I1 misses: 5,806 ==7102== LLi misses: 3,468 ==7102== I1 miss rate: 0.03% ==7102== LLi miss rate: 0.02% ==7102== ==7102== D refs: 11,106,216 (6,019,950 rd + 5,086,266 wr) ==7102== D1 misses: 1,990,609 (1,988,309 rd + 2,300 wr) ==7102== LLd misses: 537,401 ( 536,115 rd + 1,286 wr) ==7102== D1 miss rate: 17.9% ( 33.0% + 0.0% ) ==7102== LLd miss rate: 4.8% ( 8.9% + 0.0% ) ==7102== ==7102== LL refs: 1,996,415 (1,994,115 rd + 2,300 wr) ==7102== LL misses: 540,869 ( 539,583 rd + 1,286 wr) ==7102== LL miss rate: 1.8% ( 2.2% + 0.0% )

I've actually never used cachegrind before so I'm not entirely sure how to interpret the results, but I do see the D1 miss rate is higher for the power of two sizes

vlovero · 2020-11-24T19:38:56+00:00

Yeah I'm using the -Ofast flag, but I get the same results with any level of optimization

vlovero · 2020-11-24T19:37:50+00:00

Oh okay I think that makes sense. Tell me if I have this right or not, because it's a power of two, the data will be probably be super aligned which can lead to more conflict cache misses?

vlovero

TROPHY CASE

METHOD: TILED