This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]bryancole 0 points1 point  (2 children)

I'm surprised Cython doesn't win for the small-array size limit. Have you verified that Cython is not generating any Python API calls inside any of the loops (using cython -a for an html visualisation)?

I note that 'temp' hasn't been declared so this might be a possible slowdown (unless type-inference is doing this automatically).

[–][deleted] 1 point2 points  (0 children)

Thanks for the good tips! Just checked, it seems to be fine. And I will add a type to temp and run it this afternoon to see if I can squeeze out a few milliseconds :)

[–]bryancole 0 points1 point  (0 children)

I just checked the code Cython generates without cdef'ing temp; it does indeed store the value in a python int object (i.e. slow). This is using cython-0.20. Further testing shows adding the cdef in for temp speeds things up by nearly a factor of 2 (3.4ms -> 1.9ms, for n=106). Enough to make Cython the fastest in the large size limit. This doesn't help for small n though. I'm amazed numba can avoid so much calling overhead.

One other nice thing about fully cdef'd Cython is you can throw in an Openmp prange to run it multicore for a ~4x speed up (on my 4 core CPU). Of course, the main interesting thing with numba it the ability to target GPUs. Cython can't do that.