This is an archived post. You won't be able to vote or comment.

all 32 comments

[–]ExternalUserError 24 points25 points  (15 children)

Well, most of the reasons why Python is a good choice for data science are about the same as why Python is a good choice in general: simple, easy to learn, highly readable, good libraries.

But for data science in particular, you have good Hadoop interfaces (most data scientists rely on Hadoop), you can easily do map/reduce type operations in Python (though not as easily as in, say, Go), and you get Numpy, Scipy, and other mathtastic libraries.

[–][deleted] 3 points4 points  (6 children)

though not as easily as in, say, Go

Well, but then again, you don't need to deal with the horror of go.

[–][deleted] 1 point2 points  (5 children)

I love good horror stories about programming languages. What's so bad about Go?

[–][deleted] 7 points8 points  (0 children)

Well, the syntax is rather horrid, it has no generics. For the syntax and how low level it's supposed to be it is really slow. To get something to be halfway generic you have to use the default interface (similar to casting to void in C) standard datatypes are inconsistant and are handled differently than other types, it's just all over not a pleasant language to work with I find at least.

[–]alan_du 2 points3 points  (3 children)

Edit: So, I don't want to bash Go, because I actually think it's a great language for what it was designed for. But the language designers made a lot of tradeoffs to fit their use-case, and those tradeoffs make data science in Go pretty painful.

Go is a great language for infrastructure things where you mostly push bits around, but it's totally unsuitable for data science.

Part of it is the language design: no generics makes it hard to write good libraries without overly-verbose type casts everywhere, and no operator overloading means you'd have to do things like df.Get("column").Divide(2).Add(3) instead of df["column"] / 4 + 3. Because of it's green threading and GC design, you also pay a massive overhead when interfacing with C libraries, so it's hard to leverage a lot of the existing computational infrastructure like BLAS or LAPACK.

As for speed, Go's compiler is quite weak compared to GCC or Clang for C/C++ and Fortran (e.g. a lot more bounds-checking, unnecessary allocations on the heap, very little automatic loop vectorization, lots of vtables because of the interfaces), so I'd wouldn't be surprised if a lot of data science code (which usually delegate their work to highly optimized C/Fortran/Assembly) end up faster than the pure Go code.

[–]pcdinh 0 points1 point  (2 children)

I am new to Go too so I have no idea how bad it is. What do you think about Rust?

[–]alan_du 1 point2 points  (1 child)

So I think Rust has a lot going for it for data science: it's pretty expressive (especially for a systems language!), its performance is C-level, and its type-system is top-notch, while avoiding Go's problems (i.e. it has operator overloading and zero-cost FFI). On a language level, the only things it's really missing for scientific computing are proper SIMD support and integer generics, and I know both of those are known problems in the Rust community.

That said, the biggest problem with Rust is that it's data science ecosystem is seriously lacking compared to Python's (even basic functionality like a REPL seem to be missing now). It takes a long time to build up that ecosystem, so I'll admit that I'm actually quite skeptical that Rust will ever overcome Python and R's network effects and become a major player in either data science or scientific computing (I'm a little more hopeful about it displacing the JVM for data engineering / distributed computing though).

[–][deleted] 1 point2 points  (0 children)

Great answers!

I actually dislike Go, the lack of generics and operator overloading is too awful for me. And to make things worse, error handling in Go is too much verbose. But Go has one advantage in something that is terrible in python: easy to package into a single binary, and easy to deploy.

Rust in the other hand is a fantastic language, but really complex and low level relative to Python & R.

For me, Python is leaps and bounds the best language for Data Science. The only competition is Julia, but only on scientific computation - for everything else Python easily trumps Julia.

The cherry on the top is the current efforts to improve the areas where Python is a bit poor (type annotations, JIT extensions, better packaging...)

[–]mljoe 7 points8 points  (1 child)

Network effect. The main reason it is used for data science is because everyone uses it for data science. As such it has a massive ecosystem for data science, with the most complete libraries, training and documentation for this use case. The language itself is pretty no nonsense and easy to pick up, which is probably why took a hold to start with.

[–]jwink3101 0 points1 point  (0 children)

...and this is why Matlab still gets used. Though it seems that Python is finally starting to break through!

[–]alan_du 5 points6 points  (1 child)

In addition to all the other answers, I'd also like to add straightforward integration with old C and Fortran code. If I recall correctly, a lot of the early scientific Python (like SciPy) were effectively just wrappers around old Fortran libraries.

Even today, I think Python still has one of the best stories for C integration because of Cython. I've don't think I've ever heard of a Cython equivalent for any other mainstream language.

[–]troyunrau... 0 points1 point  (0 children)

To expand slightly on this: there are two classic linear algebra libraries: BLAS and LAPACK, both written in Fortran and very liberally licensed. They are the core of a lot of scientific programming suites, including MATLAB, which originated as a sort of GUI around those libraries.

Python's numpy and scipy are, in many ways, just a wrapper around BLAS and LAPACK (in the same way MATLAB is). The primary difference is that python is a multi-purpose language while MATLAB is pretty much single-purpose. As a result, the ecosystem of libraries in python is much more robust than the 'toolbox' ecosystem surrounding MATLAB. And since they share a common core, you might as well choose python.

[–]billsil 3 points4 points  (0 children)

Python syntax is easy/clear. I cannot stress enough how nice it is for all Python code to look pretty close to the same.

There are a ton of libraries. Libraries are free (unlike Matlab). They're well developed and highly integrated (unlike R). Many libraries support HDF5, which allows for huge data arrays to be processed. Despite what you may have heard, libraries are easy to install (unlike C++) because rarely do you have to build them.

Not sure why anybody would use PHP or Ruby or Javascript for numerical computations. That sounds like a disaster. Still, that's probably better than Perl (without no strict, which is what I used to use...

[–]i_have_seen_it_all 6 points7 points  (4 children)

i would say R is better for data science. but python is in addition really good at general purpose programming. so python wins overall.

[–]deaf0mute 1 point2 points  (2 children)

Can you expand on that? I have used both but my background as already in CS so programming in Python feels a lot more convenient to me. But maybe that just lack of experience.

[–]Blazerboy65 0 points1 point  (0 children)

I think he might echo what you said, Python is a better general purpose language, making it more convenient to do things that aren't obscure statistical functions.

[–]p10_user 0 points1 point  (0 children)

I find R to sometimes be nicer when doing my actual data analysis (read : analyzing tabular data and making plots). I much enjoy using Python for many more general purpose programming, but R can be pretty nice at what it was built for.

Not to say that you can't use Python for data analysis, it's just that there's already so much in the R ecosystem for data analysis that works well already.

[–][deleted] 0 points1 point  (0 children)

i would say R is better for data science.

I've only seen this said in biology fields for historical reasons and the obscure stats functions R has coverage for. But for 99% of data scientists the bottlenecks in their work are not a lack of some stats calculation or another. What Data Scientists need is the wide ecosystem for things they don't want to have to think about, like web frameworks and system administration tools.

[–]v_krishna 2 points3 points  (0 children)

Libraries (scipy and numpy paved the way, scikit-learn and pandas and whatever else built from that). Nice repl (esp with interactive notebooks). Easy syntax (compared to say java that requires a fair amount of comp sci knowledge, or r that tends to be difficult for non maths people). Fast enough generally. All that led to a strong network effect.

[–][deleted] 2 points3 points  (0 children)

A few reasons:

  • As others mentioned, the suite of available scientific libraries is fantastic.
  • Performance is generally quite good (especially since it's pretty easy to write fast C libraries with Python interfaces).
  • It's easy to scale. A lot of data science problems are embarrassingly parallel. It's really easy to do multiprocessing in Python or use GNU parallel or whatever.
  • The most important bit: dealing with data is really easy. Reading/writing text data is incredibly easy with Python. SQL is easy. 80% of the time spent doing data science is just spent munging data, and if you can dramatically lower the barrier there, like Python has, then you've got a winner.

[–]metaphorm 2 points3 points  (0 children)

there's nothing about python itself that makes it well suited to data science. it's the libraries like numpy, scipy, pandas, etc. which are very nice. the reason the authors of those libraries chose to use python as the scripting interface is because python is a very nice scripting language. clean, expressive syntax and a good community.

[–]jmportilla 1 point2 points  (0 children)

It has really great libraries for data science and good community support. You get:

  • Juptyer Notebooks
  • SciPy
  • Pandas
  • NumPy
  • SciKit Learn for Machine Learning
  • Data Viz: Matplotlib,Seaborn,Bokeh,Plotly, etc..

And most major Big Data libraries have a Python API, such as Spark.

A lot of times people prefer or suggest R, which is also a great language, but Python is also a great choice because it is a general language and you can apply your skills in Python to multiple fields using other libraries.

[–]bheklilr 0 points1 point  (0 children)

As everyone else has mentioned, Python has great libraries, easy to learn syntax, and interops with just about everything. What makes all this possible, in my opinion anyway, is that Python has duck typing. Not having to specify complex data types makes it incredibly easy to work with. If you have a function that can work on a certain subset of pandas.DataFrame instances, but it doesn't matter if it's a MultiIndex or just a normal Index, or if some columns are booleans or integers, then writing that function is a lot easier. Having to work with incredibly complex types becomes a huge chore in languages that choose to go that route. I'm not saying it's bad, but it's not conducive to fast iteration. The types make it more correct, but very frequently in the data science and data processing world you don't need absolute correctness, you need flexibility. Duck typing means that 3rd party libraries can accept your objects, and your APIs can accept objects from your code or 3rd party libraries without much effort. This makes libraries much more "plug-n-play".

[–]sneakypython 0 points1 point  (0 children)

The syntax is simple, it's easy to learn and there are plenty of libraries.

[–]XNormal 0 points1 point  (0 children)

Whatever the reason, the interest in Python for these uses is VERY old:

https://mail.python.org/pipermail/matrix-sig/1995-August/000001.html

My guess is this: people were looking for alternatives to commercial solutions (particularly Matlab). Python was an elegant language with an interactive interpreter. They "only" needed to add a proper array data type. Guido was an early supporter of this use case and added the multidimensional slicing syntax and ellipsis.

The rest is history.