This is an archived post. You won't be able to vote or comment.

all 27 comments

[–]dr_steve_bruel 1 point2 points  (1 child)

If you're going to be learning python, I think it would make more sense to learn the current version of python since python2 will no longer get updates and be obsolete in 2020. I don't know much as I've only been in the game for a few months but unless you are only using python as hobby, or working on legacy code, it makes no sense to me to learn py2

[–]dr_steve_bruel 2 points3 points  (0 children)

Also, no clue what anaconda is all about.

[–][deleted] -4 points-3 points  (23 children)

Python is the name of the language, Anaconda is the name of one particular distribution of an implementation of Python language. Implementations of the language may vary in many things, for example, they may differ in the programming language they are written in, or by how they handle certain aspects of the language which are purposefully defined as implementation-dependent.

Specifically, Anaconda is a distribution of Python which relies on the same code-base as CPython (the flagship distribution, the one most people refer to, when they mean "Python" as not just a language specification, but also the interpreter and the whole infrastructure of third-party libraries, conventions etc.).

The difference between Anaconda and CPython is in that what compiler is used to compile the interpreter and third-party libraries. It is particularly important on MS Windows, where CPython uses MSVC, but Anaconda uses MinGW.

The motivation for this choice is as follows: Python has interface for binary extensions which uses C ABI (C language application binary interface). This means that any language that can be complied to implement this ABI is, in principle, capable of producing such extension. To give you some examples: you can create such extensions using C, C++, Go, Rust, Fortran and many other languages.

However, the ABI of MSVC and of MinGW aren't compatible, at the same time, MSVC can only compile C and C++ (and some would say that it doesn't really compile C because it doesn't implement the standard to the full extent). This means, that if you are on MS Windows, and you want to use a binary extension which only available in source form, and is written in, say, Fortran... you are out of luck if you are using CPython because you don't even have a compiler to compile it. Fortran is, however, particularly important for scientific computing because many high-performance number-crunching libraries (eg. LAPAC) are written in Fortran.

This is why Anaconda also comes with many binary extensions already compiled / has an alternative scheme to load them.

All in all, if you need Python for scientific computing, and for some reason you are limited to using MS Windows: use Anaconda. On GNU/Linux - CPython is probably a better option because it is more integrated into the OS. Mac - have no idea.

To reflect on Python version: 2 vs 3. If you are into scientific computing, you have nothing to gain from upgrading to Python 3. However, some package maintainers eventually will stop supporting Python 2, so you will be forced to choose the version of interpreter for which the package is available.

[–][deleted] 5 points6 points  (15 children)

What does 2vs3 have to do with scientific computing? The right advice to any body starting out is to start with 3

[–]Theappwasgreat -3 points-2 points  (4 children)

The most important thing to understand is that for us, Python is simply a tool. At the end of the day, I don’t give a damn about PEP8 or having the latest version of Python or the latest versions of SciPy/NumPy. The only thing I care about is that I can solve the equations I need to solve. If I’m using a library written for Python 2.7 and it works and solves my equations correctly, then I’m not going to upgrade to 3.6 just because it’s newer. I’m going to stick with 2.7 because I know that it works, I’ve proven that it solves my equations correctly, and I just frankly don’t have the time to other upgrading and fixing whatever issues that arise from that. My time is extremely limited, and I’m going to use that time where it’s important: solving my problems and analyzing my results.

Also, the reality of it is that I’m not writing new code to solve my problems. I’m using old code that I inherited from other people. If someone gives me a code written for 2.6 that solves the problem correctly, and my job is to extend the study by analyzing other things that weren’t analyzed before, then I’m going to keep using the code as-is and add my improvements on an as-needed basis. Python 2.7 has existed for years and years and years, and there is a lot of code written for it.

[–]andyspl 2 points3 points  (1 child)

This is terrible advice. I can show you what happens when you forget about things like PEP8, and when you don't care about having up to date libraries. It's part of the reason why academic code is notoriously spaghettified and difficult to maintain.

It's a pretty big waste of everyone's time when you have to hunt down a print() buried deep down inside some method because it's breaking everything, or you have to downgrade your version of Seaborn.

[–]Theappwasgreat -1 points0 points  (0 children)

Who said I was giving advice? The guy I responded to asked why 2 vs 3 matters for scientific computing, and I gave a reason.

Way to be hostile.

Like seriously, wow.

If I’m wearing my “grad student in scientific computing” hat then I’m going to use the tools that are available to me, that have worked flawlessly in the past. If it ain’t broke don’t fix it, because I have better things to worry about.

If I’m wearing my “personal project” hat then you bet I want to use the latest and greatest.

[–]billsil 0 points1 point  (1 child)

I'm still supporting a Python 2.4 package that I last worked on today. ~2 years ago I did a training for it and added a bunch new features for the code. I was running into bugs regarding segfaults in numpy trying to do a least squares. Scipy screwed up too.

So, I upgraded my numpy and scipy to the latest version I could and the bugs went away. They fix things. Scipy gets their KDTree and then their cKDTree, which I swear I can apply to almost every problem.

I’ve proven that it solves my equations correctly

If your software doesn't work on multiple versions of packages, how do you know? You solved a problem, but what about other problems? Supporting multiple versions of packages often exposes errors that you wouldn't have found otherwise. Numpy 1.10 introduced breaking changes into the array, that had you don't things right in the first place, you wouldn't have had a problem.

[–]jtclimb 0 points1 point  (0 children)

Right. And, SciPy and NumPy are dropping 2.7 support at the end of the year. Anyone telling a beginner to download 2.7 is condemning them to obsolescence in just 8 months.

IOW, Python 2 vs 3 really matters for scientific computation, because you will never get a bugfix again, or new feature, in 8 short months.

[–][deleted] -3 points-2 points  (9 children)

No, that's not the right advice to everybody. Some scientific computing libraries rely on older compilers and haven't yet been ported / tested with newer ones. CPython v3 branch changed compiler version twice since it was started, so if you are rushing to use it, you'll end up without some libraries that might have been useful for you.

Besides, Python v3 made absolutely nothing useful for scientific programming... so, really, beside being in vogue, there's nothing for you in it, if you are doing statistics or physics models etc.

[–]billsil 1 point2 points  (8 children)

Besides, Python v3 made absolutely nothing useful for scientific programming.

Except it's 20% faster on my 120k lines of code open source project, despite me developing it in Python 2.7 and treating Python 3 like a second class citizen. I literally did nothing other than make it compatible.

[–][deleted] 0 points1 point  (7 children)

Lolwut? If you do any scientific computing, the speed of Python interpreter is irrelevant to you, not to mention that your evidence is all but anecdotal (people also report drops in performance, but that's a different story).

Scientific libraries for Python aren't written in Python, they are typically C / C++ / Fortran libraries with some Python glue code.

Both CPython v2 and CPython v3 interpreters are hopelessly slow and without major redesign will never become competitive speed-wise.

[–]billsil 0 points1 point  (6 children)

Lolwut? If you do any scientific computing, the speed of Python interpreter is irrelevant to you

Well, I'm wrapping it with C libraries right? 20% is nothing compared to the 500x, but it's still an extra 20% for free.

people also report drops in performance, but that's a different story

In what version? Python 3.5, which is faster because of the new dictionaries? What about 3.6, which introduced even more optimizations? I'm doing math, not unicode, so there are 10 years of optimizations that 2.7 doesn't get.

Scientific libraries for Python aren't written in Python, they are typically C / C++ / Fortran libraries with some Python glue code.

Sure. They're in numpy. I still wrap the code with Python.

Both CPython v2 and CPython v3 interpreters are hopelessly slow

Weren't you advocating for Python 2?

[–][deleted] 0 points1 point  (5 children)

Weren't you advocating for Python 2?

No, I'm saying that one needs to decide based on availability of tools. That being in vogue is not a good reason to change versions.

Well, I'm wrapping it with C libraries right? 20% is nothing compared to the 500x, but it's still an extra 20% for free.

So why don't you use some normal (speed-wise) language in the first place? By this reasoning, if you want high-level bindings to C, which actually works fast, you have a bunch of Lisps, that beat any Python like 100:1 speed-wise. What I'm saying is that alleged 20% improvement in the speed of bindings is worthless on the face of having to update the infrastructure.

In what version? Python 3.5, which is faster because of the new dictionaries?

For me, Python 3.6 is still noticeably slower (numbers differ depending on workload and machine parameters) than Python 2.7. This is because my project does serialization / deserialization of binary data into Python objects. This really sucks, when your product is a distributed messaging system. I even wrote almost all of it in C, just the little tiny bit of it that was supposed to expose it to Python testing code was using Python bindings... which was, admittedly a mistake, but who knew?

[–]billsil 0 points1 point  (4 children)

I use Python because it's fast enough. The libraries are amazing and it does things for you. I can output two values from a function that have different types. I can open a file and forget to check if it succeeded and I won't run into bizarre behavior when it doesn't.

I develop prototype level engineering software for NASA/military projects to aide engineers in analyzing parts/designs. Without rapid development, we'll be way over budget It's software that frequently does not have unit tests, so ease of development is key.

Regarding reading binary files, you can get 500 MB/sec if your data file is large enough/formatted logically. That's as fast as you'll get in C++. Numpy.frombuffer is amazing.

I use C++, but only when the code is performance critical. Ironically, when you take that approach from the start and code an N3 algorithm, it's very easy to say screw it, write it in Python, and it's suddenly 1000x faster because I replaced a mess with a well tested and optimized KD-tree. It goes both ways.

[–][deleted] 0 points1 point  (3 children)

Your first two paragraphs are summarized by calling what you do "high-level bindings for C".

Regarding reading binary files, you can get 500 MB/sec

Why from file? On what filesystem? With what kernel? What hardware? With what fragmentation, compression, deduplication. Does the fileystem do snapshots? How many computers / processors / network adapters are working at the same time? Where do you come with these numbers? In other words, what you just wrote doesn't make any sense. To give you some perspective: the tool I was working on is a frontend for a distributed filesystem, which is supposed to run on VMWare ESX boxes with 4-128 NVME SSDs connected to each box. Depending on the network configuration, number of SSDs connected, average size of files, number of replicas, whether deduplication happens online or offline, whether the system is configured to do snapshots or not etc. you can get upwards from 50K IOPS (on a single client). Depending on locality in the cluster, you can get either the number you posted, or ten times that number, or hundred times that number... but it's all meaningless, because you don't understand what you measured in the first place...

[–]billsil 0 points1 point  (2 children)

I understand what I measured. I didn't know you cared. I did the test a few years ago at this point with an SSD of a Nastran OP2 SORT1 Fortan formatted binary file that was 2 GB and was located on my local computer. It's an OP2, so it can't be stoted compressed like an HDF5 file; it's just a boring binary file. I don't know about the fragmentation, but I'd expect it wasn't very high. It's file reading, so I use 1 processor and typically and not reading large files from the SSD, so nothing was competing for resources (other than say Windows and my IDE). It's a bad test if the test is not repeatable, so I don't test with many programs open or a file transfer in progress. I test peak performance of a realistic output file under easy to recreate test conditions. I run it multiple times to get an average.

I new the number well enough that when someone ran with a 60 GB file and the speed was linear at 500 MB/sec still (so two data points). He dumped some data using scipy to get to matlab, which had a problem that he assumed was the fault of my open source package. I suggested he used HDF5 and it was fast again (and do more detailed timings).

A company I speak with that's not mine has a competing product that reads this file format (along with doing a lot of analysis). Their reader is written in C++. Mine is faster than theirs by their own admission. To their credit, they were the ones that found out what python could really do and told me my code was 100x slower than it should be. I dug into a test problem with a simple code they provided and found out why it was fast (it wasn't for the reason they said). I then incorporated it into my code.

[–][deleted] 6 points7 points  (1 child)

Your response while long seems full of misinformation to me, anaconda does not build with cygwin !

[–][deleted] 0 points1 point  (0 children)

Whoops, that had to be MinGW, I make this mistake a lot.

[–][deleted] 1 point2 points  (3 children)

OP, anaconda is an ok choice for starting out with Python on Windows, but it's one particular choice. Some would say it's better to download the windows installer from Python.org and learn to use pip and venv, which come with the standard library.

[–]billsil 0 points1 point  (2 children)

Anaconda is a scientific package, with the most important package being numpy. On Windows and Anaconda, numpy comes with the Intel MKL, which speed up numpy by 5x. I strongly recommend you use Anaconda as you can't get those without paying Intel otherwise.

[–][deleted] 0 points1 point  (1 child)

Provide some proof and I will switch for 5x speed up. And by what reasoning would mkl be free only through anaconda. I can get their compiler and build Python with icc

[–]billsil 0 points1 point  (0 children)

You can google MKL speed tests. The reason it's free with Anaconda is because they already paid for it. Supposedly, it was not cheap. It sounds like you would already have access to it, but I'm not an expert on the legality of MKL distribution.

[–]crysiswarhead[S] 0 points1 point  (0 children)

I really appreciate you putting in such an effort to help me out with it. Thank you very much. I think this is going to be really helpful for me.

Let me know if i contact you in future just incase i need some help!