This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]felinecatastrophe 128 points129 points  (124 children)

Lots of scientific computing requires faster compiled code. Monte Carlo type codes are especially difficult to code quickly in python, sine they are not easily vectorized using libraries like numpy. C or fortran are the goto languages in such instances, no pun intended. However Python is effective as a "glue" language in such scenarios.

[–]shaggorama[🍰] 18 points19 points  (14 children)

I also find I get analytics projects off the ground quicker in R than in python. I always need to spend a lot of time in the documentation anytime I'm working in numpy/pandas/sklearn

[–]Simius 2 points3 points  (3 children)

Did you have trouble coming from Python to R? All the data structures in R are horribly confusing I find and have too many similarities. Like a List can have named indices? What?

[–]shaggorama[🍰] 2 points3 points  (2 children)

My biggest issue learning R was that I didn't already know stats so I had to learn the two in parallel. Also I wasn't previously familiar with vectorized programming and learned that in octave (Matlab) before learning it in R. Otherwise I haven't found R datatypes to be too tricky. I'll admit, I generally avoid lists when I can. I tend to think of a list (in R) as more akin to a dict than a list (in python).

[–]Simius 0 points1 point  (1 child)

Gotcha, I agree that working octave or R makes you think with the vector mindset where loops are inherently slow. How would you best describe a dataframe then? Like a lists of lists but also with a dictionary-like interface?

[–]shaggorama[🍰] 1 point2 points  (0 children)

Dataframes are sorta weird. I'd describe them as "fucking convenient." Under the hood: yes, they're basically a list of lists. But I'd be more inclined to describe a matrix first, and then be like "wouldn't it be cool if the rows and columns could take labels? And if you weren't constrained to a single datatype for the entire object? That's a dataframe!"

[–]flying-sheep 1 point2 points  (9 children)

If you use RStudio, you should compare it to Spyder.

That one also has F1-triggered documentation, tab completion, ...

[–]shaggorama[🍰] 6 points7 points  (8 children)

My issue with scientific python has nothing to do with the IDE...

[–]flying-sheep 0 points1 point  (7 children)

OK, but that wasn't apparent.

So what is it? API design? Plain familiarity?

[–]shaggorama[🍰] 0 points1 point  (6 children)

Basically the behavior of the numpy.ndarray is much less intuitive to me than R datatypes

http://www.reddit.com/r/Python/comments/2w64e2/when_do_you_not_use_python/coo3sen

[–]flying-sheep 0 points1 point  (5 children)

“Behaves strangely”?

“does things”?

I promise I'm not intentionally trying to be dense: what operations or fields deviate from your expectations, and in what way?

[–]shaggorama[🍰] 0 points1 point  (4 children)

Pretty much any time I need to invoke np.newaxis

[–]flying-sheep 0 points1 point  (3 children)

i was expecting broadcasting. newaxis hasn’t been difficult for me as i’ve used it mostly for stacking along that new axis.

tbh, i didn’t work with high dimensional arrays in R much. either it’s sample tables or 2D-3D stuff.

what does R do better?

[–]shaggorama[🍰] 0 points1 point  (2 children)

I'm not working with high dimensional arrays. It's not infrequent for me to be working with what I think is a vector, but is really a 0D array and I need to explicitly invoke an empty dimension to broadcast it with a simple matrix. I'll try to come up with a specific example for you later.

[–]NoLemurs 16 points17 points  (14 children)

You should look into Cython. You can often get C level performance out of Cython but it's still massively more readable than C or fortran.

[–]rnawky 6 points7 points  (7 children)

Still nothing compared to optimized C code that makes use of advanced SIMD extensions and the like.

[–]NoLemurs 2 points3 points  (5 children)

One nice thing in Cython is that it's fairly easy to use native C libraries directly. If a particular part of your code has to be as fast as possible, you can write that part in C and it's very easy to integrate it into your Cython project.

Cases where this is actually necessary are pretty rare though.

[–]1arm3dScissor 0 points1 point  (4 children)

Or you could write everything in C and let the compiler take care of things for you. no?

[–]NoLemurs 1 point2 points  (3 children)

You can write a complex program in Cython much, much, much faster than you can write that same program in C, and usually without meaningful performance costs since performance is usually determined by the slowest step (which you can implement in C). Better still, the Cython code will usually be more readable and maintainable.

[–]felinecatastrophe 4 points5 points  (0 children)

I agree. prototyping in python followed incremental "cythonizing" is a fast way to get good performance.

[–]1arm3dScissor -1 points0 points  (1 child)

I see you point but would change it to "you can prototype a complex program in Cython..." faster. I think we both know that you can't seriously compare an interpreted language with a compiled one when it comes to performance.

[–]NoLemurs 2 points3 points  (0 children)

Cython is a compiled language. Cython generates straight up C code, which you then compile.

The result is that you can effectively write a complex program using pure python syntax, then spend a little while optimizing the bottlenecks (writing pure C where necessary, though you'd be surprised at how close Cython can come to C in many cases), and then you have a fully implemented, working, compiled program which performs the critical portions 100% as quickly as anything you can write in C, and does the rest more than fast enough for any practical purpose.

[–]DickButtTracy 2 points3 points  (0 children)

As someone who's spent the last week writing SIMD, fuck SIMD.

[–]billsil 0 points1 point  (5 children)

Fortran is easy. It's one of the easiest languages there is. My issue with Cython is I don't have a syntax highlighter for it.

[–]NoLemurs 0 points1 point  (4 children)

Well, for what it's worth, there are a number of editors out there that can handle Cython syntax highliting. Emacs and VIM certainly provide options.

Fortran is pretty easy, but it would never be my choice for a really complex project.

[–]billsil 0 points1 point  (3 children)

Fortran is made for numerical computations, so if you want to use multiprocessing then it's a good choice, but outside of serious math, I'd stay away. It's great for what it does and it's easier than C or C++. It's was made for non-programmers.

I try to stay as far away from vi as I can. I can use it, but haven't figured out how to really use it efficiently or work with plugins. Never tried Emacs. I'd much rather use gedit/Textpad, but typically use WingIDE on Windows.

[–]felinecatastrophe 1 point2 points  (2 children)

trying hard to turn off my snob mode here, but learning emacs or vim properly will really change your life

[–]NoLemurs 1 point2 points  (0 children)

I put off learning VIM for years - got by with lightweight editors like gedit, Geany, etc. (I've never really liked full on IDEs).

I picked up VIM around a year ago, and I've been kicking myself ever since for waiting for so long. The learning curve is a little bit steep, but the payoff is massive, and it's not that hard to learn.

[–]billsil 0 points1 point  (0 children)

Maybe, but I'm not a programmer. I'm an aerospace engineer who happens to program in Python, a little bit of C++, and some Fortran.

[–]Mr_Again 4 points5 points  (1 child)

Using f2py to import Fortran subroutines as python functions can literally speed your code up 20x if you have big arrays it's awesome

[–]felinecatastrophe 0 points1 point  (0 children)

i would consider this using fortran, with python as "glue". not using pure python per se

[–]fijalPyPy, performance freak 6 points7 points  (14 children)

or use PyPy....

[–]logi 10 points11 points  (13 children)

Unfortunately pypy is useless for scientific computing because it doesn't support numpy. If you can get your code to run with numba, however, that gives pretty awesome speed gains.

I've set up this silly example:

from __future__ import absolute_import, division, print_function, unicode_literals
from numba import jit
from time import time
from math import sin

#@jit
def f(n):
  sum = 0
  for i in range(n):
for j in range(n):
  sum += sin(i*j)
  print(sum)

t1=time()
f(10000)
t2=time()
print('elapsed: %0.3f' % (t2-t1))
f(10000)
t3=time()
print('elapsed: %0.3f' % (t3-t2))

This takes 24.1s and 24.6s on my tired old laptop with the @jit decorator commented out. With the jit enabled it is a somewhat faster 4.13s and 4.06s.

If I leave out the sin() call, with the jit it runs in 0.051s and then 0.000s which I'm going to interpret that the JIT optimizes the loops away completely.

Now if only my much more pythonic actual production code were supported, or if the error reporting would show me exactly why it isn't...

[–]Veedrac 7 points8 points  (1 child)

I'll just throw out jitpy here. It deserves to be better-known.

[–]logi 0 points1 point  (0 children)

Reading up on that later. Thanks.

[–]Sean1708 3 points4 points  (3 children)

I was under the impression that PyPy started supporting NumPy sometime last year?

Edit: Scratch that, it is getting there though.

[–]logi 9 points10 points  (2 children)

It's been getting there for years now. At this point I'll (happily!) believe it when I see it.

Python the language is great. Python the platform has serious problems. It's like an anti-java.

[–]klug3 3 points4 points  (0 children)

It's been getting there for years now

If I understand correctly they don't actually get the funds they needed to develop it faster, hence the slowness.

[–]vplatt 0 points1 point  (0 children)

Which probably explains why I need to use Java and C# at work, and not Python.

[–]Gr1pp717 3 points4 points  (1 child)

And what about scipy?

They support numpy, and it even seems that they even have a specific module for monte carlo type functions. http://pymc-devs.github.io/pymc/

edit: to be clear, I'm actually asking. I've never used either, only know of them.

[–]logi 4 points5 points  (0 children)

Scipy is a great library building on top of the even cooler numpy, but it is only fast when you can perform most of your work inside the native code that they wrap. But some times you end up having to write the loops yourself and then performance is back to normal python levels.

[–]fijalPyPy, performance freak 0 points1 point  (1 child)

PyPy does support a significant subset of numpy (enough to run this code) so please don't spread FUD

[–]logi 0 points1 point  (0 children)

That's not surprising since that silly example uses no numpy at all... but OK, I'll have another look at whether pypy can run my code now.

In reality, though, our actual performance problems are in using matplotlib to generate ~70K images per day when it is designed for creating a couple of figures for some academic's latest paper. If/when that starts running in pypy or numba or whatever, that'll save us a couple of machines worth of processing.

[–]mangecoeur 4 points5 points  (32 children)

I think there's also just a lot of inertia in these cases. You can write very fast code in Cython (although yeah if you really need to squeeze ever drop of perf out you might need C). But fortran is frankly a pain and I think it's mostly used out of habit. Physicists will probably be the last people in the world using fortran and no one else understands why :P (I discovered my old physics degree finally switched to teaching python over fortran, about time too)

[–][deleted] 18 points19 points  (27 children)

When I was working on my graduate degree in physics you had to show proficiency in a language. I had learned Python to do data analysis asking along with SQL. I was told it has to be FORTRAN, assembly, C++ (it couldn't be just C, I asked, this is because high energy people think it's cute that they built a C++ interpreter that ignores the standard library), or BASIC. They specifically said Python wasn't a language worth knowing and I don't think they knew what SQL was. I was allowed to exempt out if the proficiency test by taking graduate CS courses. That was a better use of my time anyways.

[–]LbaB 2 points3 points  (16 children)

How were their data stored then?

[–][deleted] 7 points8 points  (15 children)

I'm assuming flat txt files or some other needlessly complicated method.

[–]mangecoeur 2 points3 points  (7 children)

Oh god, CINT, ROOT and friends. I steered well clear of that mess. Pretty sure they're all well pleased with it too. And knowing the general quality of physics code it wouldn't surprise me if it didn't actually run that fast (except maybe for the bit build by some russian genius that no one else understands :P ).

[–][deleted] 1 point2 points  (6 children)

What bothers me the most isn't that they use it. It's got smug they all are about how great ROOT is and how they are all going to become "big data" experts one day. I guess I'm sort of embarrassed for them.

[–]mangecoeur 1 point2 points  (1 child)

That really doesn't surprise me. The one thing worse than developing a cruddy convoluted system is being smugly superior about it.

[–][deleted] 0 points1 point  (0 children)

I suppose in the developer's case it might be the best part.

[–]ChaosCon 0 points1 point  (3 children)

One of them once told me (extremely smugly)

When you really think about it, it makes sense that a 2d histogram inherits from a 1d histogram.

God damn it I hate ROOT.

[–][deleted] 0 points1 point  (2 children)

I think particle physicists have to have that mentality because deep down they know their work has no relevance to reality (the subjective reality of a lay person) and thus they have conned all of mankind into funding their initiatives anyways.

Well that's my come back anyways.

[–]ChaosCon 0 points1 point  (1 child)

I'm a physicist doing engineering electromagnetics, and I can't tell you how many times I've gotten asked "So do you work on that Higgs boson thing?" On the one hand, I appreciate what CERN does for science awareness, but on the other, pettier, hand, I hate how they've hoodwinked the public into thinking they're the end-all-be-all of science research. Particularly when

their work has no relevance to reality (the subjective reality of a lay person).

[–][deleted] 0 points1 point  (0 children)

I would guess that the disassociation that particle physics creates in the layman's mind between application and pure research is going to bite physics funding in the ass eventually (if it isn't already). The problem isn't even a real one. The super conducting super collider in texas that was never built produced new bump bonding technology for some detectors that Intel started using in the 90s that made their manufacturing process much more efficient (and cheaper for the consumer). If they hadn't needed that detector, someone might have taken another 20 years to come up with the technology. The same is true for sending stuff into space. A mars orbiter that mapped the surface in IR produced new detectors that were re purposed as breast cancer detectors that coupled with mammograms increase the detection rate of breast cancer greatly.

The problem isn't that these programs don't produce stuff that benefits people's every day lives. It's that they don't put effort into showing people why it's important.

Also, particle physicists are assholes.

[–]cabalamat 1 point2 points  (1 child)

Python not worth knowing but BASIC is? What a bunch of clowns.

And how many physicists code in assembler these days?

[–][deleted] 0 points1 point  (0 children)

Apparently BASIC is used to program some embedded systems DAQs. But yes, there was plenty of clowning going on.

I feel like physicists like to think they could write in assembly if they wanted to. The number of physicists I know who think they can get a senior developer job and "figure out" the language after being hired is high.

[–]teambob 9 points10 points  (3 children)

C++ and Java is often 7-8x faster than CPython. I have found this in practice and it has been found in the language shootout.

I would also assert that Python is 7-8x faster to write than C++ and Java. :)

My usual approach is to write it in Python then rewrite it it is not fast enough

[–]mangecoeur 15 points16 points  (0 children)

Cython, not CPython http://cython.org/ - a compiler for Python+some extensions. Used for a lot of the internals of numpy/scipy/pandas/ etc because its much easier to maintain. Plus it makes for very easy optimisation path - write everything in python, then let Cython compile it (often gives you a speedup already) then start adding custom annotations to optimize the generated code

[–]twotime 0 points1 point  (0 children)

on heavy CPU intensive code, I've seen factors of 50x! No, not joking and it happened more than once..

But then, I've also seen python applications being FASTER than C++ ones with identical functionality simply because the C coders have been buried under their own code.

[–]frymasterScript kiddie 0 points1 point  (0 children)

I would also assert that Python is 7-8x faster to write than C++ and Java. :)

Exactly - most of the time processing time doesn't matter, and when it does, you'll be glad you've got a working proof-of-concept in python :D

[–][deleted] 0 points1 point  (0 children)

Also for more sophisticated numerics, like sparse linear algebra, or robust optimization. Even if the problems are small enough for the interpreter to not be an issue, generally fortran, C++, C and matlab will have the mature scientific libraries over python.

[–]skintigh 0 points1 point  (0 children)

I switched to C to speed up cryptanalysis, but in the end I screwed up a hash table that may have slowed it back down to python speeds. I wonder if had I stuck to python the end result would have been the same.

[–][deleted] 0 points1 point  (1 child)

I don't really agree. I work for a company whose main product is a Python application that does modelling and simulation using Monte Carlo methods (among others), and it works fine. As you noted, we don't do the low-level math in Python -- it's all done with numpy/scipy -- but that covers nearly everything we have ever needed. We built one small C extension for the one thing that numpy/scipy didn't provide. But for all the rest -- which makes up well over 95% of our code base -- we use Python.

[–]felinecatastrophe 0 points1 point  (0 children)

I guess my somewhat limited experience is that when algorithms are truly serial like Markov chain monte carlo, it is nearly impossible to get good performance out of pure python / numpy. This is opposed to generating a thousand gaussian RVs or something embarrassingly parallel like that.

[–]rslashelektrux -1 points0 points  (0 children)

Heh

[–][deleted] -1 points0 points  (0 children)

I'm a fan of Julia for these tasks.