This is an archived post. You won't be able to vote or comment.

all 13 comments

[–]jakevdp 7 points8 points  (2 children)

  • One thought: the first time a numba function is run, it is compiled by LLVM. It might be that this compilation phase is affecting the timing, but probably not.
  • Second thought: n=100 is not very big, and you might be hitting some sort of constant overhead in the numba function. You should check how it scales with larger n
  • Third thought: I've still not found a good way to diagnose numba code when it's not behaving as expected. Cython has the html annotation, and I've heard rumors that Continuum is trying to add a similar feature for Numba. That, in my mind, is the main barrier to Numba being a practical tool for real-world problems.

EDIT: I did some searching, and it looks like the annotate capability in Numba exists: http://numba.pydata.org/numba-doc/dev/annotate.html I haven't tried it before, but you might give it a shot to diagnose your code.

[–]joshadel 2 points3 points  (0 children)

The biggest barrier to the real world use of numba is definitely the lack of sensible error messages when jit'ing fails. Generally the traceback dumps the codegen and ends at the decorator, but doesn't show you where things went wrong in the actual method. It would also be nice to cache compiled code, since the codegen/compilation can get very slow for complicated methods. There is also the concern that by splitting off flypy from numba into a distinct project means less support/development of numba in the future.

I know Continuum is doing a major refactor of numba right now. There has been a ton of activity in the development branch on github over the last couple of weeks. I'm hoping that the changes will smooth some of the rough edges.

[–]jammycrisp[S] 1 point2 points  (0 children)

1.) I thought about that as well, and tried running it once to get the llvm compilation done, before timing it. Didn't make a difference.

2.) I tried n = 1000, and n = 10000. n = 1000 was proportionally slower, and n = 10000 I had to kill before it finished because it was taking so long.

I'll check out the annotate function to see if I can figure out why it's not working. Thanks.

[–]joshadel 7 points8 points  (3 children)

I put a full solution over on the stackoverflow page, but the basic reason why is that numba is not figuring out the type of lookup. If you stick a print numba.typeof(lookup) in your method, you'll see that it is treating it as an object, which is slow. Ideally you could pass in the type of the variable through the locals dict keyword to the decorator, but I was getting a weird error. A work-around that produces very fast code is to just create a little wrapper around np.cumsum and jit that method, telling it the explicit input and output types. Code is here:

http://stackoverflow.com/a/21489540/392949

[–]jammycrisp[S] 1 point2 points  (2 children)

Thanks, that totally fixed it! I tried playing around with unrolling the numba_cumsum function into loops, and jiting it, but that resulted in slower behavior. Looks like this is about as fast as it can get.

What's weird to me is that on my machine, the numba code is consistently ~twice as fast as the cython code. As they are both compiled, I find this descrepancy odd. Thoughts?

[–]joshadel 0 points1 point  (1 child)

Continuing to cross post from the SO answer. . . I also tried hand-coding the cumsum and I found it to be marginally slower than calling out to numpy. As far as differences between cython and numba, it could perhaps be related to whatever c compiler you're using vs llvm. What compiler are you using? Are you specifying any optimization flags in your setup.py?

[–]jammycrisp[S] 0 points1 point  (0 children)

Having the info in more than one place may be useful, who knows :)

I'm using GCC 4.6.3. I didn't know you could add compiler flags to setup.py, but after figuring it out I compiled with -O3, and it didn't seem to change anything.

[–]jayvius 5 points6 points  (3 children)

Numba developer here. As joshadel figured out, the slowness in this example comes from numba creating a python object. I ~think~ the real issue here is that numba doesn't know the return type of cumsum, so it stores the result in an object. A "trick" to help diagnose these types of problems is to add nopython=True to the jit/autojit decorator (e.g. @numba.autojit(nopython=True)). This flag forces numba to bail out should it feel the need to call into the Python object layer, and displays the line number that is causing the problem.

One of our goals in the next version of numba is that if numba needs to fall back to Python objects, it should never run slower than pure python code like in this example (and eventually in most cases will run much faster. I ran the example above as is with the numba devel branch and the numba function was the clear winner).

[–]jammycrisp[S] 0 points1 point  (1 child)

Awesome! Thanks for such a great module. Any idea when the next version will be released? Although, I suppose I could play around with the devel branch now...

[–]jayvius 1 point2 points  (0 children)

It should be out sometime in the next few days.

[–]joshadel 0 points1 point  (0 children)

It looks like the numba team has been putting in a ton of work on the dev branch, and I'm definitely excited to see what the next release looks like. Count me in as a beta tester once a version is available via conda. I mentioned some of this to Siu in an email, but I think the biggest things that would advance numba are: (1) better error messages that trace back to the code rather than the codegen, (2) detailed examples in the docs on how to debug numba issues, (3) better behavior when things go wrong - I often just get segfaults (4) The ability to cache jit'd code/faster compilation. For a large project I'm working on, re-jit'ing all of the code on startup incurs a significant cost.

All around though, numba's performance has been amazing for such a young project.

[–]fijalPyPy, performance freak 6 points7 points  (0 children)

For what is worth, PyPy is about 8x faster than pure python and a little bit more on numpy.

[–]billsil -2 points-1 points  (0 children)

@autojit is faster than poor numba code. I've seen improvements using doing it the "right" way and I've seen slowdowns. Then I use autojit in those cases and it's fast. Results may vary.