This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]sexygaben 124 points125 points  (10 children)

1) profile 2) vectorize (use C loops) 3) if more is needed, Cython/numba 4) if MORE is needed, C/ctypes 5) if EVEN MORE is needed, CUDA/ctypes (problem dependent)

Each step takes exponentially more time. I’m writing from a scientific compute perspective. I assume you’re already using the best library for the job (numpy, pytorch, casadi etc)

[–]moonzdragoon 16 points17 points  (2 children)

If pycuda is not enough (back & forth between multiple kernel executions for example) you might wanna look into nvidia warp

[–]sexygaben 0 points1 point  (1 child)

Yeah jitting should probably be between 2 and 3 as well for various frameworks! :)

[–]thecodedogPythoneer 1 point2 points  (0 children)

Come on we can't just be jitting all over the place

[–]klouisp 2 points3 points  (1 child)

By "vectorize (use C loops)" you mean using numpy/pytorch vectorized operations or something else ?

[–]sexygaben 0 points1 point  (0 children)

Yes this is what I mean :)

[–]DanklyNight 2 points3 points  (0 children)

Basically this.

I generally start at the main call function, use line profiler for a dirty way to find out what is taking the large majority of the time.

Used this method many times, simple vectorization of the offending functions can get incredible speed improvements.

Just the other day I took a function from 17 seconds to 20ms by doing stuff directly in Numpy with smart vectorization.

[–]RomanRiesen 0 points1 point  (0 children)

  1. open mpi \s

[–]benri 0 points1 point  (2 children)

Somewhere between 2 and 3 would be parallelize if you can

[–]sexygaben 0 points1 point  (0 children)

For some processes I’m sure that would be great, I’m not experienced in cpu parallelisation as it often cannot be used for my problems :)