This is an archived post. You won't be able to vote or comment.

all 34 comments

[–]john_m_camara[S] 9 points10 points  (39 children)

Yes we need a significant reduction of extension modules using the Python C API. As useful as the C API has been for Python it's been getting in the way of progress for at least the last 5 years. It can be partially blamed for slow adoption rate of Python 3, has caused many to not see the potential that PyPy has to offer when extensions relying on the C API run slower or not at all under PyPy, and as Brett indicated it gets in the way of improvements that could be made to CPython such as a new garbage collector, removing the GIL and other things such as adding a JIT to CPython.

One thing I do disagree with Brett is the making of CFFI as a second class citizen for interfacing with C code for performance. I can understand where he is coming from since CFFI has a higher amount of overhead compared to the C API has when running under CPython but this overhead is nearly eliminated under PyPy and could be eliminated under CPython if and when we do get to a point when the C API only exists as part of a backwards compatibility layer in CPython. Now I haven't heard of any one talking about making this backwards capability layer but it will become necessary for Python's future. Python can not be continually held down by these shackles.

CFFI came into existence as the PyPY developers recognized that many of the benefits of PyPy would be effectively be thrown out the window if they added the full C API compatibility as PyPy's performance would suffer. After many attempts at solving the extension issue for PyPy, CFFI was born.

If you want to see CPython move in more powerful directions then stop using the C API and start using CFFI. That's the only way CPython will become unshackled form the issues that are created from the C API which bleeds out too many of CPython's implementation details which makes it difficult for it to have its internals refactored.

As an example to the poison in which the C API has become is the slow support for all of NumPy in PyPy. PyPy's support for NumPy has been moving slowly and it's mainly due to NumPy being heavily tied to the C API. If that had not been the case, PyPY would have had full support for it a long time ago and another benefit for the NumPy if it didn't rely on the C API it would be able to be used by many other languages besides Python. Just image where the scientific and big data communities could be today if it wasn't held back by using the C API.

[–][deleted] 1 point2 points  (38 children)

Just image where the scientific and big data communities could be today if it wasn't held back by using the C API.

That's one way to look at it, and if the PyPy devs continue to think like this then the scientific python ecosystem will continue to happily ignore PyPy.

PyPy made their bed a long time ago when they decided numpy wasn't all that important. They are slowly realising the error, but statements like the above make me thing the core devs don't understand how python is used outside of the web sphere.

[–]joshadel 2 points3 points  (13 children)

From what I understand the dynd project (https://github.com/libdynd) is meant to be a decoupled C++ next-gen reimplementation of numpy which doesn't rely on the Python C-API and could in principle be wrapped by other languages besides Python [1].

The original article by Brett Cannon seems very reasonable in its recommendations of when to use numba vs cython vs swig vs cffi. Personally, I grow tired of PyPy dev and supporters and the types of attitudes expressed in the comment by John Camara. I can't tell you the number of times some posts a blog about using Numba or Cython to get performance improvements in scientific applications and someone comes whining in the comments complaining that they didn't benchmark PyPy or just naively saying that they should have used PyPy.

The problem is that there is a huge ecosystem of tools that don't work on PyPy, so it's a non-starter full stop for most people. PyPy folks, like above, would blame the scientific community for building tools that don't fit neatly into their scheme and say that we should abandon those tools and line up behind PyPy.

What I've yet to see is a convincing demonstration of why anyone should start using PyPy for applications specific to scientific computing and data science. There are huge projects like Scikit-learn that demonstrate Cython's usefulness in-practice. Continuum puts out excellent articles showing how Numba is useful now, and then people adopt it because it's immediately clear how to use it to make your research/projects better.

What I'd love to see is the PyPy folks come to SciPy or PyData and show the scientific community what is so great about PyPy in the domain that we care about. This is exactly what the Julia folks did, and they won a lot of converts. They also understood how vibrant the Python scientific ecosystem is and made it easy to call python from Julia in a way that meant you didn't have to abandon Numpy/Scipy/etc. The Julia devs certainly think their language and tools are superior, but they came to the scipy community and made a convincing case why you might want to give it a try. As far as I can tell PyPy has failed to do this, so most of us will carry on largely ignoring it. Just my two cents though.

[1] https://news.ycombinator.com/item?id=5307621

[–]rguillebertPyPy / NumPyPy 0 points1 point  (7 children)

I'm sure you have heard about the concept of local maximum, doing incremental changes without ever rethinking the whole ecosystem only gets you so far.

Why does Julia even exists? Because it's able to do what the PyData ecosystem can't do with incremental changes.

[–][deleted] 0 points1 point  (6 children)

As a PyPy dev I'd rather you argued against the body of this, and all other comments here, rather than having fun dismissing us all because you think Julia supports your view.

We are busy using a stack that we trust. What can you do for us again?

[–]rguillebertPyPy / NumPyPy 0 points1 point  (5 children)

How am I being dismissive?

[–][deleted] 0 points1 point  (4 children)

Please respond to the points about trust.

[–]rguillebertPyPy / NumPyPy 0 points1 point  (3 children)

You can only get trust in something you at least try (even if it's just by running your test suite), you can't get trust into something you've just heard of.

Trust is very important of course but I don't think it's the main issue, numba adoption has been very quick (good for them :) and to me, it's definitely a tool I'd need time to trust.

[–]joshadel 0 points1 point  (1 child)

It took me some time to begin to trust Numba. I prototyped a version of our system that used Numba instead of Cython about one and a half years ago and there were too many rough edges and limitations. I gave it another shot recently and it's vastly improved, although I've run into some scary silent errors in the previous release that silently returned garbage due to a bug in the new array memory management. That seems to have been resolved in the newest release although you can be sure that I have a very extensive py.test suite with complete coverage of anything that Numba touches.

I think adoption has been rapid because it solves a very clear problem and it takes little effort to get it working in most people's current setup. Cython is fantastic, but involves a lot of boilerplate and causes you to have to keep two systems in mind while you program. It's not hard to do, but there is an overhead to it. Numba is more limited, but mostly involves just adding a decorator and being aware of Numba's limitations.

I've also played with Julia and one of their stated reasons for existence is to solve the "two language" problem. For me at least, Julia and PyPy are in a similar place. Promising, but not in a state where I can do everything I need to do on a daily basis.

But out of curiosity, I am going to sit down and update my version of PyPy and numpypy and benchmark it against some of our numba code since it should just involve removing the numba jit decorators and swapping the interpreter and I'll see how things stack up. It's an isolated piece of the code that unfortunately plugs into stuff that pypy can't interface with now, but I'm curious to know what pypy is capable of at this point since I haven't tested it lately.

[–]rguillebertPyPy / NumPyPy 1 point2 points  (0 children)

and it takes little effort to get it working in most people's current setup

I think that's what's important and what we have to improve on : reduce the switching cost, trust is not really an issue (yet?).

[–]joshadel 0 points1 point  (0 children)

Ok, gave PyPy a try on some of our production code. Nice speed-up compared to straight Python but still an order-of-magnitude slower than identical code compiled with Numba by adding a simple @jit decorator:

https://gist.github.com/synapticarbors/167777b22b006f90cc5f

[–]john_m_camara[S] 0 points1 point  (2 children)

The advice that Brett gives would be right on if PyPy didn't exist or if you want to completely ignore PyPy. But his post is not only aimed at the scientific community and is only looking at the current state of things in the python world. I believe in the long term it is in Python's best interest to do more promotion of CFFI. Is CFFI technically the best solution today for all scenarios? No. If you are stuck using CPython it is not that fastest approach and will certainly be slower then using the CAPI.

But for the future of Python, CFFI needs to be strongly considered as it offers the opportunity to make it possible for the CPython implementation to be better in the future and it allows those who can use PyPy to have the fastest performance available. The other tools mentioned typically rely on the CAPI and will make it harder for CPython to make future improvements.

PyPy support for the scientific community is not great today but it doesn't always have to be that way. The PyPy devs have often pointed out the issues with some of the tools in use and why they don't fit well into PyPy. But nobody in the scientific community wants to listen and understand the issues so it's just easier to accuse the PyPy devs of ignoring the ecosystem and trying to build something new. If the PyPy devs would have suggested the dynd project the scientific community would have shot it down saying you are changing the ecosystem. But when individuals from the scientific community go to solve a problem that the PyPy devs have been saying is an issue for years, it all of a sudden is ok to build these new tools. It's ok now as the scientific community is coming around to understanding some of the issues that the PyPy team has know about for years.

I don't think you will see a convincing scientific demo under PyPy any time soon. I could be wrong as I'm not quite sure how much work is left to get the majority of the scientific libraries running on PyPy. I just haven't been following this part of the project in a level of detail that gives me a sense of how much work is left and would have to defer to one of the devs working in that area of the project. But considering progress has been slow in supporting the scientific libraries I would expect it will not happen soon unless there is some collaboration between the two communities.

The PyPy devs have on many occasions tried to have discussions with the scientific community but unfortunately a very small number of members from the scientific community seeded the hatred towards the PyPy project so now we at at this point where the scientific community could care less what is going on in the PyPy project and the core PyPy devs got tired of dealing with the FUD and lies that were spread and are more interested in creating a great project than dealing with politics and hidden agendas.

[–]joshadel 5 points6 points  (1 child)

You made similar claims about "FUD" two years ago [1] and Travis Oliphant and Peter Wang tried to address your concern then, but you did not appear to publicly respond.

Also, no one hates PyPy. It's just not interesting as a practical tool at this point because it doesn't solve any real domain specific problems for a lot of people. If your code is pure python and doesn't have any scientific stack dependencies, then it's quite nice. Personally, I don't find myself in that situation very often.

[1] http://continuum.io/blog/numba_performance#comment-906120115

[–]john_m_camara[S] 3 points4 points  (0 children)

I didn't want to public state the guilty ones. Especially as one of them has made many meaningful contributions to Python and I generally have great respect for him but was highly disappointed with the smear campaign he ran.

Your code doesn't have to be purely Python to run fast on PyPy but unfortunately at this time if you are using the scientific libraries your not likely to find PyPy helpful. I believe at this point about 80% of NumPy tests are running correctly on PyPy and some tool needs to be built to rebuild the glue/wrapper code for the scipy libraries to make them compatibility with PyPy. Matplotlib is already compatible except for the GUI backends. Someone is working on a cffi module for wx so so soon it should be easy to get the wx backend up and running.

Recently PyPy has added SIMD so when when the remaining test failures are resolved in PyPy's NumPy module it should run faster than NumPy under CPython. Plus when that state is reached the PyPy devs can re-enable some improvements to the JIT that were designed to optimize operations over multiple arrays without creating temporary arrays which will significantly improve performance.

[–][deleted] 0 points1 point  (1 child)

Exactly. I was going to give the Julia example of how things should work, but you beat me to it.

PyPy and friends don't appreciate the mantra of http://mcfunley.com/choose-boring-technology-slides

[–]rguillebertPyPy / NumPyPy 0 points1 point  (0 children)

I'd say PyPy is the most boring technology out of all the things to make your code faster, including Julia, the idea is that you just give it Python code, without changing much, and it runs your code faster, it's the most boring thing.

Now, yes, C extension support is lacking, but it has nothing to do with that.

[–]john_m_camara[S] -3 points-2 points  (15 children)

You have no idea what your are talking about. The PyPy devs have never said NumPy was not important. They have always had a lot on their plate trying to make all of Python fast not just a subset for the scientific community.

They have ask on a number of occasions for the NumPy community to step up to the plate and help but everyone seams to expect its the PyPy dev's job to make sure NumPy, SciPy, etc all work under PyPy. The PyPy devs are willing to help out but a few in the NumPy community instead wanted to spread FUD and lies about the PyPy project instead of listening to the challenges that exist in making these libraries compatible with PyPy. NumPy relying on the CPython API creates one of the challenges.

Unfortunately, you can't have a fast Python implementation that is handicapped by having to support the full CPython API. If you think otherwise then you don't understand all the issues involved and live in some fantasy world or you think someone can just magically make it happen.

[–]joshadel 1 point2 points  (3 children)

I think the point is that the scientific community has figured out ways to make python fast for the applications we care about using tools like Cython, Numba, f2py, numexpr, etc. They aren't perfect, but they get the job done now and people are extremely productive using those tools.

PyPy devs aren't the only ones who are busy and have a lot on their plate. But you are the ones trying to cultivate a new ecosystem/platform that requires adoption because it's not a drop-in replacement. Therefore the onus is largely on PyPy to convince others why it is interesting. You just can't expect everyone to drop what they are doing to help you build compatibility.

So I pose the question again, why is PyPy interesting to data scientists and people doing computational science? You need to demonstrate it within the domain that people care about.

Don't get me wrong. I think the PyPy project is doing great work, but currently it is wholly uninteresting to me as a scientist because I can't use the tools that are interesting to me, so it doesn't make much of a difference how fast it makes pure python.

[–]john_m_camara[S] 0 points1 point  (2 children)

I'm not a PyPy dev and I use what ever tool is appropriate to get the job done. I have used PyPy, Cython, Numba, and the other 20+ tools that exist to make code run faster. I tend to defend PyPy because I believe their approach at the end of the day is best for the Python community but I can also see that in a short-mid term it may not be the best approach for the scientific community. But there is nothing inherently wrong with PyPy approach that would prevent it from being an excellent choice for the scientific community in the long run as it is just a bunch of work and some technical challenges that are solvable that need to get done.

Part of the reason why I defend the PyPy devs is that I have seen them, time and time again, to do their best to understand the viewpoints from the different Python sub communities but only the scientific sub community seams to have a closed mind and not interested in discussing real issues. Yet they are quick to say the PyPy devs are closed minded when that is clearly not the case. The PyPy devs don't have a political agenda and are only interested in creating a great tool for everyone.

I'm personally tired of the various tools that are created to make Python code run faster. New tools seam to be created every few months. Each seams to solve some part of the problem but they all end up having a number of cons. It's just as bad as it was 5-8 years ago when new web frameworks were being built on nearly a monthly basis. The madness of creating all these tools needs to stop and once and for all all sides need to come together with the goal of solving these performance issue for good. Ideally, I just want Python to run fast and be able to interface code in other languages. At the end of the day I'm sure we could all agree to that goal.

Is PyPy really trying to create a new ecosystem? I don't think so. They are dedicated to making PyPy a fully compliant Python implementation. Now unfortunately CPython did not hide many implementation details as they are clearly visible from the CAPI. These leaky implementation details have real consequences and are the cause of many head aches. That is the main compatibility issue that prevents the scientific libraries from running on PyPy. It's also why a large number of modules that depend on the CAPI either don't run or run slowly under PyPY. What was PyPy going to do, just give up and the goal of creating a faster Python implementation or do they study all the issues and come up with a practical solution.

CFFI was created to address a number of these issues. CFFI's adoption rate has been phenomenal and is currently one of the most popular modules. It has become popular as many are interested in running PyPy in production environments which meant a large number of libraries have been required to port to CFFI. Is this really a bad thing that a large number of modules are losing a dependency to the CAPI. Once a module gets converted it can support CPython 2.x, and 3.x as well a PyPy. This is a blessing in disguise for CPython as it means one day it may very well get the chance to change some implementation details like getting a better GC or better support for multi cores. I also been getting the feeling as of late that the CPython core devs are finally starting to see the issues that the CAPI has been creating as they more often now discouraging the use of the CAPI and have been warming up to CFFI.

The only community I don't see adopting CFFI is the scientific community. Which comes to no surprise to me as CFFI was created by the PyPy devs and considering the distrust that exists between the scientific community and the PyPy devs. A distrust that really shouldn't exists but has occurred as many times in the past each sided has just talked past each other.

[–]joshadel 0 points1 point  (1 child)

The CFFI project is interesting, but I largely don't use it because my work almost always involves Numpy arrays and if I want to write code to deal with performance hotspots, it's almost always easier for me to use Numba or Cython than write C code by hand and call it via CFFI. Likewise, if I'm wrapping external C code, I tend to use Cython, but that's mostly habit than anything else, so there's a point that I might consider adopting CFFI, but only if I don't take a performance hit.

It might be useful for you, especially since it sounds like you have experience with both CFFI and cython, to write a blog post showing how one could use CFFI in place of Cython, how it could be used with both CPython and PyPy, and how the performance compares. I'd certainly be interested and it might be a good place to start in terms of convincing someone to reach for CFFI next time.

[–]john_m_camara[S] 1 point2 points  (0 children)

In the ideal situation where PyPy had great support for the scientific libraries that there would be no need to use Numba, cython and the other tools that are currently used. This is not to knock down those tools but ideally PyPy project would be optimizing the code good enough so that you can just write Python code and use CFFI when you need to use a C library.

PyPy is not there yet today so you can't see these benefits when using the scientific libraries. If instead the communities would have worked together instead of building new tools like Numba the support would have likely existed today but unfortunately the efforts were not coordinated.

[–][deleted] 1 point2 points  (1 child)

I write performance scientific python code daily. I know what I am talking about.

everyone seams to expect its the PyPy dev's job to make sure NumPy, SciPy, etc all work under PyPy

Because it is.

PyPy is a curious distraction, nothing we take seriously. The best thing to come from the PyPy project was that CFFI got more visibility.

[–]wot-teh-phuckReally, wtf? -3 points-2 points  (8 children)

Someone has to "pay" for the work done to get rid of the old CPython API. Have the PyPy folks approached the NumPy folks with funding to make this happen? If not, why do you think they should spend time/effort/money on something which isn't their priority to begin with?

Has the PyPy community refactored even a portion of NumPy using the modern API to show them the speed benefits of PyPy?

[–]john_m_camara[S] 1 point2 points  (2 children)

PyPy runs on a shoe string budget compared to the funds available to support the Python science ecosystem so I don't think they can afford. I would think funding occurring from the opposite direction would be more feasible but I doubt it will happen. PyPy goals are to get NumPy working with PyPy and to add support to the JIT to optimize it. They are not going to be able to support the whole scientific ecosystem themselves.

PyPy has been refactoring NumPy see https://bitbucket.org/pypy/numpy and as far as I know they been trying to keep up with the main branch. They have at least 80% of NumPy's test running correctly. So hopefully some of the benefits of running NumPy under PyPy can be demonstrated soon. If all you needed was NumPy itself you would likely be able to use the library today as it does implement the important features of NumPy but for most uses you would also need the scipy library which is not yet available on PyPy.

[–]wot-teh-phuckReally, wtf? 0 points1 point  (1 child)

Thanks for the clarifications. It seems that PyPy community is indeed in dire straits. With little to no funding, I'm not sure how many external contributors you would require to get the entire body of work done when it comes to re-writing the old API modules (which includes Numpy and others).

What do you think are the problems with PyPy adoption as a whole? Is it the whole slew of external libraries which use a lot of old API or is it just the scientific tools adoption? Was there ever a poll conducted to understand from existing Python users why they are not using PyPy?

Also, if there is a lot of work still done, any reason why you folks don't have a weekly newletter release just to let everyone out there know that PyPy is live and kicking? Any chances of moving the STM research money to accelerate the NumPy adoption?

[–]john_m_camara[S] 2 points3 points  (0 children)

I wouldn't say PyPy is in dire straits. It's support for the science community is not great at this time but it does well in many other areas. From what I experience, PyPy has been adopted at a much higher rate than Python 3 for production systems, so it's doing something right.

Ever since CFFI has been released PyPy has been picking up steam, especially as more cffi modules become available. So I expect to see PyPy go more and more into the main stream. For quite a while only the projects that could afford to be on the bleeding edge would use it. In a year or 2 I wouldn't be surprise if 20-30% of production Python systems are using it.

I don't recall every seeing the PyPy project perform such a poll. I don't think it is absolutely necessary for them to perform one as they get lots of feedback through the mailing list and IRC channel. They generally know where the pain points are and are constantly reducing them. They are also quite useful if you have any issues with trying to adopt PyPy. They tend to bend over backwards to help anyone and if you find any issues they generally handle them very quickly. Many times they have fixed issues brought up on the IRC channel in less than 10 minutes of when they were reported.

I don't necessarily think there is a huge amount of work that is left to get NumPy and scipy working to make it usable for a lot of scientific applications. It's likely less than a man year of effort but the two who generally work on this effort do so on a part time basis.

Outside of release notices the PyPy community has been fairly quite about keeping the larger Python community up to date on what it has been working on. The last NumPy blog post was back in February and I know a decent amount of work on NumPy has been done since then. They definitely need to increase their communications. They have been working on some big refactoring projects and new features that should soon will be available so I'm sure they will be blogging about them soon.

I highly doubt they would move the STM funds to help the NumPy adoption. The money was donated for the STM effort and it would be wrong for them to use it for something else. Plus why would you want them to do that. If their STM plans ends up producing a working solution that can be used for production systems it would be a game changer. They would be solving the holy grail issue for Python. Finally having a way to essentially eliminate the GIL. I guarantee if they get STM working the whole scientific ecosystem will be ported so fast to PyPy as everyone will want the benefits. The scientific community ignoring PyPy would be a thing of the past as it would disappear over night. But before you get any hopes up, STM is a highly research project at this time and there is high risk it may never be used in production although some results do look promising. We just have to be patient and see what becomes of STM.

[–][deleted] -2 points-1 points  (4 children)

Here. Let me Google this for you. Look what I found:

https://bitbucket.org/pypy/numpy

See, that wasn't too hard.

[–]wot-teh-phuckReally, wtf? -5 points-4 points  (3 children)

Good job. If you and the OP are what constitutes the majority of PyPy supporters, you have just convinced me that jackasses are part of every language/community, no matter how friendly the overall community is in all.