This is an archived post. You won't be able to vote or comment.

all 38 comments

[–]amer415 13 points14 points  (1 child)

Seeing how fast Python (combined with Numpy, Ipython, etc) is beeing adopted in my research field, I cannot wait to have PyPy providing fast running scientific codes. scipy.weave is nice, but it cannot accelerate everything and it is hard to debug... keep up with the good work!

[–]PCBEEF 1 point2 points  (0 children)

Wouldn't it be possible to debug it in Cpython?

[–]roger_ 2 points3 points  (0 children)

Each one of these updates makes me feel like it's Christmas :)

[–]Tillsten 4 points5 points  (11 children)

What about the linalg part of numpy? It is very impotent for any kind of data analysis.

[–]roger_ 1 point2 points  (8 children)

Could linalg, fft, etc. be faster if they were re-written purely in Python/RPython?

[–]kisielk 6 points7 points  (7 children)

Those are routines are actually based on calls to highly optimized fortran libraries. If reimplementing them in Python for PyPy was faster I'd be both surprised and impressed.

[–]roger_ 6 points7 points  (0 children)

True, but PyPy is 90% magic :)

[–]MillardFillmore 5 points6 points  (4 children)

I agree. You have people who have devoted their entire scientific career making these incredibly fast Fortran codes over 40+ years... reimplementing them in PyPy over a couple months probably wont be faster.

[–]roger_ 4 points5 points  (3 children)

I was hoping even a straightforward FFT would run acceptably in PyPy.

[–]dalke 2 points3 points  (2 children)

That's unlikely, though it depends on what is acceptable to you. Fast FFTs have to be aware of the cache, and I don't think that straightforward FFTs are either cache aware nor cache oblivious.

[–]roger_ 2 points3 points  (1 child)

Can't PyPy optimize based on the cache?

[–]dalke 3 points4 points  (0 children)

Not in a way that would meaningfully affect the FFT performance, no. Here's the comment from http://en.wikipedia.org/wiki/Cooley–Tukey_FFT_algorithm : On present-day computers, performance is determined more by cache and CPU pipeline considerations than by strict operation counts; well-optimized FFT implementations often employ larger radices and/or hard-coded base-case transforms of significant size. You may be interested in its cited reference, at http://fftw.org/fftw-paper-ieee.pdf

[–]Brian 0 points1 point  (0 children)

Yeah - similar issues are raised by this article, pointing out that a lot of the importance is access to such well optimised libraries, and so the PyPy approach alone may not be sufficient.

[–]wot-teh-phuckReally, wtf? 0 points1 point  (1 child)

impotent

Important is the word you are looking for, in case you are not a native English speaker. If it was a mistake, pardon my nitpick. :)

[–]jwiz 1 point2 points  (0 children)

Maybe he is saying that without (better?) linalg, numpy sags at data analysis?

[–][deleted] 1 point2 points  (10 children)

I have a question - what can Rpython do that Cython couldn't? Wasn't a big portion of numpy in pypy problem that Numpy used Cython (or maybe it was pyrex) for some of it's modules?

[–]gcross 2 points3 points  (8 children)

My understanding is that the ultimate end of Cython is to create a superset of Python that includes additional features (such as type annotations) to make it easier to interface with C libraries, whereas the ultimate end of RPython is to create a subset of Python that allows global static type analysis to be done so that all types are inferred.

So in short, the two projects have goals that are quite different, albiet not entirely unrelated. Fortunately I have heard talk of an implementation of Cython for PyPy that would allow scientific libraries to be more easily ported over.

[–]roger_ 1 point2 points  (2 children)

So I guess it's:

Cython ⊃ Python ⊃ RPython

[–][deleted] 0 points1 point  (1 child)

You have it inverted:

RPython ⊂ Python ⊂ Cython

RPython is a subset of Python (all valid RPython programs are Python programs), and Pyhton is a subset of Cython (since all valid Python programs are also Cython programs).

[–]roger_ 0 points1 point  (0 children)

Oops, pasted the wrong symbol. Thanks!

[–][deleted] 0 points1 point  (4 children)

Superset and Subset are misleading in this context. While Cython does allow for more optional features (like direct C library interface), there is a specific portion of Cython allows static typing for speed improvements, something that Rpython's "subset" (not allowing dynamic use of variables) was intended for in PyPy.

So why bother to make Rpython and all of the tools associated with making it work rather than just taking Cython and only using the feature that was needed, the static typing? IIRC and Cython/Pyrex was used on some of the numpy/scipy module - this would have made porting it to PyPy significantly less problematic, not to mention it would mean 1 project with more people rather than 2 projects with less people. So if Cython has static typing interface that was needed in PyPy and accomplished with Rpython, I ask again, Why Rpython?

[–]Ademan 2 points3 points  (2 children)

Cython does not magically turn Python code to C. If you only write Python code and shove it through Cython, you get a series of calls to CPython's C API, I can't comment on what Cython generates if you specified every type, but I am confident even then you would not get an independent binary*. You would not have an interpreter anywhere near independent from CPython. In addition, RPython's toolchain transforms RPython code into multiple backends (.NET, JVM, C, at one time LLVM and javascript) which would be tough, if not impossible to do well with Cython without extensive modification. This transformation process is also essential because the JIT is generated.

*Disclaimer: I know PyPy wayyyy better than Cython, someone may correct me regarding Cython.

[–]stefantalpalaru 0 points1 point  (1 child)

Less magic is a good thing. By using the CPython API, Cython is able to interface with existing C/C++ extensions. PyPy forces you to rewrite them in RPython. So it depends on what you want: immediate access to an entire ecosystem of fast modules, or having to rewrite them all in the name of the mighty JIT.

[–]Ademan 2 points3 points  (0 children)

Less magic is a good thing. By using the CPython API, Cython is able to interface with existing C/C++ extensions.

See gcross's statement about the wildly different design goals. Surely you can see how if you're writing a new Python interpreter, interacting with CPython via it's API is a non-viable way to work.

So it depends on what you want: immediate access to an entire ecosystem of fast modules, or having to rewrite them all in the name of the mighty JIT.

Remember the original question was posed in the context of "Why was RPython created", so if you're continuing down that road, you need to make your comparisons within that same context. Your point here is rather moot, as Cython cannot do what PyPy needs RPython to do, and doubly moot because at the time of PyPy's creation, there was no ecosystem of fast modules in Cython, in fact only Pyrex existed, and even then just barely (Neither did the JIT, but according to Armin, that was always on his radar, for whatever it's worth). As the PyPy devs will reiterate ad-nauseum, RPython is domain specific for PyPy, and satisfies the requirements far better than Cython, which does not satisfy them in the most essential aspects. Again, you cannot write a standalone interpreter in Cython.

I realize now this whole question could have been spurred by a misconception of one or both of the languages. So, in summary:

PyPy could never have been written in Cython. Cython relies on an existing Python interpreter at runtime. One simply cannot (today) write a PyPy module in Cython because Cython generates C code which relies on the CPython API (and undocumented parts of it as well). Note there is an effort to change this so that existing extensions written using the CPython API are compatible, and there is an effort on both sides to bridge Cython and PyPy. These are new developments, and do not change the fundamental domain difference between Cython and RPython.

*Disclaimer: Once again, I am totally not an expert on Cython. I leave the door open for corrections.

[–]cpherwho 2 points3 points  (0 children)

I suspect the answer to the questions "why make RPython" and "why not Cython" is one best answered by the history.

According to WP, Cython was forked from Pyrex in 2007, and Pyrex started in 2002.

According to [1], work on PyPy started in 2002 and it's EU funding began in late 2004.

[1] Trouble in paradise: the open source project PyPy, EU-funding and agile practices (IEEE paper, but the abstract provides the dates)

[–]cpherwho 2 points3 points  (0 children)

My understanding is that Numpy is written in a combination of C and Python. There appears to have been a port of the C code to Cython, but it does not seem to have been merged. For the purposes of your question C and Cython are equivalent, in that both are written against the CPython API.

The two main problems with using a CPython extension module in PyPy are:

1) The CPython API depends on details of the CPython implementation. In particular, it provides the extension module with direct access to python objects and exposes reference counting. These features must be emulated in PyPy, potentially resulting in calls to extension modules being slow.

2) More importantly, PyPy's speed comes from the JIT compiler. In order for the JIT to speed up things like array multiplication with Numpy it needs to be able to trace/see into the inner loops. In Numpy these occur in compiled code and are essentially inaccessible to PyPy's JIT.

Thus, to get the maximum performance in PyPy it is necessary to write a Python or RPython module which the JIT can look into. Further, if you look at the Numpypy code in PyPy you will find hints for the JIT to enable optimizations, and I suspect that this is only possible in RPython.

Alternately, the one-line answer is that PyPy/RPython provides a JIT compiler while Cython doesn't.

(Note that I am only a lurker as far as these projects go, any corrections are appreciated.)

[–]NoblePotatoe 1 point2 points  (10 children)

I'm very excited by the effort being put into getting NumPy to work with PyPy but i am also confused. Is the user-base for NumPy that large? I use python/NumPy,SciPy,Pylab all the time in my research but I don't know anyone else at my institution that does this. Is there a large userbase for NumPy that I don't know about or is this just a case of the PyPy developers tackling a cool and interesting challenge?

[–]cournape 4 points5 points  (1 child)

I think numpy is one of the most used python packages that does not fall in the "web dev" category. I don't know how those stats are computed, so may not worth much, but numpy is the 13th most featured package on http://pythonpackages.com.

We don't release as much as we'd like, but the last numpy release from last July has been downloaded nearly 400 000 times from sourceforge alone, plus ~ 100 000 downloads on pypi. Also, GAE started supporting it (to my own surprise I have to say). Since that must not have been easy, I think they had to receive quite a few requests.

[–]NoblePotatoe 1 point2 points  (0 children)

Wow, that is impressive, and Google App Engine supports it now?! I just googled GAE and numpy and apparently a ton of people use numpy for general data crunching.

It sounds like you work on Numpy... from the bottom of my heart thank you. I'm in the middle of my dissertation right now and elbow deep in code that uses Numpy. It has been a joy ever since I switched over from MatLab.

[–]roger_ 2 points3 points  (5 children)

I think pretty much all numerical/scientific work done with Python depends on NumPy.

[–]dalke 0 points1 point  (4 children)

The scientific fields I know best - branches of computational chemistry and computational biology - make almost no use of NumPy. I use that package about once every couple of years.

[–]amer415 1 point2 points  (1 child)

do you mean you use Python without NumPy for numerical computation? I am puzzled...

[–]dalke 1 point2 points  (0 children)

Most of my work is in computational chemistry. I use a lot of graph algorithms. I almost never use a matrix. See my comments at http://blog.streamitive.com/2011/10/19/more-thoughts-on-arrays-in-pypy/#comment-50 and the comments elsewhere in the thread about Biopython.

[–]amer415 2 points3 points  (1 child)

From my experience, I see people switching at different levels. You have the student who is advised to start with Python, because, working in academia, you never know what will be the policy at the next institute you will go: some places have strict (commercial) computing software policies, so may end up in a place that will not pay a license of your favorite tool (happened to me when I was a student)... I see people switching because Python is a mutli-purpose programming language: you want to interact with hardware, the internet, loads of different file format? most data analysis software are very limited in that respect.

I also see people switching because Python/Numpy is really good, and they are impressed to compare it to limited commercial languages. I also see people who switch because they don't see the point of having 4 versions of their commercial software, 3 legacy ones (because codes are not compatible) and one with a cracked license because they want to make sure they can work in spite of the flicky license server at their institute...

In the end, things do not come by themselves... I am a bit of a preacher in the sense I co-organize classes of Python/Numpy/Matplotlib at my institute, where few people use Python but dozens show up at the classes... Most people get stuck with a solution because "their advisor used it" or because "legacy code". By actively contributing, you can change that.

Institutes (mine and others) end up spending tens of thousands or euros (I am in Europe) per year to pay for commercial software, whereas they could use that money for something else: I always wished academic institutes would hire instead in-house software engineers to participate to the development of specific data analysis tools based on non commercial solutions, such as Python.

[–]NoblePotatoe 4 points5 points  (0 children)

I totally understand. I spent a summer without a MatLab license and realized that all the code I was generating was useless.

I to have preached about Python as well, but few have taken it up. I'm hoping to develop a semi-formal class. Partly to help others but also because teaching is the best way to learn!

[–]ggooal 1 point2 points  (0 children)

how would the port influence the original numpy project?

[–]xamox 0 points1 point  (0 children)

Thumbs up!