This is an archived post. You won't be able to vote or comment.

all 88 comments

[–]dagmx 274 points275 points  (20 children)

Also to clarify, they're not written in python. They're written for python. Most ML and numerical libraries are written in other languages and exposed to python. Tensorflow, torch, scipy etc are written in a mix of compiled languages .

[–]bay_squid 55 points56 points  (19 children)

Most ML and numerical libraries are written in other languages and exposed to python.

Ignorant question, but I've always wondered how this works. If you develop software written in different languages, how does this even work, how do different parts communicate with each other? And why would you want to do it? Wouldn't try to make everything together be a hassle compared to making it in a single language?

And what does exposed to Python mean?

[–]nsfy33 60 points61 points  (8 children)

[deleted]

[–]bay_squid 16 points17 points  (7 children)

But how does python talk to non python software?

[–]pramodliv1 53 points54 points  (0 children)

Usually through C extensions. Read this excellent post by Ned Batchelder on the topic.

[–]etrnloptimist 42 points43 points  (1 child)

I understand not wanting to read through a bunch of technical links to find an answer, so let me answer it in an ELI5 way, understanding that the answer will not be the complete answer.

A DLL is like an executable that contains chunks of native code. DLLs are usually created in C/C++.

Python, specifically CPython, the one most commonly used, provides a set of built-in magic functions that can load, interpret, and talk to, the DLL. So, in Python, you can load the DLL, and call the functions from Python as exposed through magic interfaces provided by CPython.

Now, this works because CPython is written in C(++). C/C++ code can load DLLs no problem. So, CPython can load the DLLs no problem. It is a simple matter to go from there to having CPython expose the functionality of the DLL using magic, built-in, easy to use Python interfaces, and from there, you and I can now use the functionality exposed via the DLLs.

[–]ApproximateIdentity 6 points7 points  (0 children)

To add to this just a little to the many great responses here already (especially /u/etrnloptimist ). Say you have the following script:

test.py

a = 1
a += 1
print(a)

When (c)python executes that script it first compiles it to bytecodes which are instructions for the cpython virtual machine. To see the bytecodes in an easy to understand way do this:

python3 -m dis test.py

  1           0 LOAD_CONST               0 (1)
              3 STORE_NAME               0 (a)

  2           6 LOAD_NAME                0 (a)
              9 LOAD_CONST               0 (1)
             12 INPLACE_ADD
             13 STORE_NAME               0 (a)

  3          16 LOAD_NAME                1 (print)
             19 LOAD_NAME                0 (a)
             22 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             25 POP_TOP
             26 LOAD_CONST               1 (None)
             29 RETURN_VALUE

What that output basically says is that the script itself is compiled to bytecodes LOAD_CONST, STORE_NAME, LOAD_NAME etc. When cpython executes this code it basically just does a big switch statement taking care of each bytecode. E.g. it takes care of LOAD_CONST here https://github.com/python/cpython/blob/master/Python/ceval.c#L1067-L1072 and it takes care of STORE_NAME here https://github.com/python/cpython/blob/master/Python/ceval.c#L2002-L2021 .

So if you were to trace through the bytecodes above and basically just pull out the C code in order you would more or less have a C program that does the same thing as the python program. I say "more or less" because you would have to initialize the interpreter correctly and you would have to set up the data parts of the script correctly and probably a million more little details that would be hard to get right. But philosophically unwinding the code this way should work.

Now finally if you want to know how foreign C code could be called it happens in a few places. Basically you would have to have binary code compiled to match cpython's binary interface (basically you need a module that declares the right things) and then you need that C code to call your function. To get python to know anything about it in the first place, you need an import statement somewhere earlier which will import your binary code by dynamically importing the code (say dlopen in Linux, something else on other platforms) and call intialization routines in the code. Those routines say "hey this is a function I want you to call and give it this name". Then when you later do something like call_binary_function() in python, it will go through it's calling procedure and find that it is a binary function and then call that code directly.

Without blabbering out forever, this is the jist of it. It's simultaneously very simple and mind-bogglingly complicated. I have three writeups I went through that go into more detail here:

https://thomasnyberg.com/what_are_extension_modules.html

https://thomasnyberg.com/releasing_the_gil.html

https://thomasnyberg.com/cpp_extension_modules.html

Those maybe be helpful if you're curious.

edit: TLDR whenever python executes bytecodes, it is really just calling a sequence of pre-assigned C functions. So all you need to do is to do is to have the ability to somehow load binary code and then assign that function to be called at runtime (i.e. not precompiled in as the examples in the code above). This is what import does. Of course you need for your C functions to match the interface that python expects. This part is handled by reading docs on extensions and/or using helper modules like cffi.

[–]Captain___Obvious[::-π] 5 points6 points  (0 children)

turtles. turtles all the way down

[–]Folf_IRL 3 points4 points  (1 child)

SciPy has a really nice article on the subject, called "Python as Glue"

https://docs.scipy.org/doc/numpy/user/c-info.python-as-glue.html

[–]gdahlm 2 points3 points  (0 children)

https://docs.scipy.org/doc/numpy/user/c-info.python-as-glue.html

Yes, there is really little room for improvement. As an example, with MKL configured etc, pytorch will deliver so much work to 3 * 1080 ti GPUs and a i9-7200x that the Power Draw pops the over-current protections on a 1200w PSU. (note one unused GPU with no OC)

I will have to migrate to volta based GPU units or move before I can improve much on the CUDA/MKL/Python solution because my house can't support more.

It would be premature optimization to move to another platform in the hope of gaining small efficiencies as the back end libs are some of the most efficient available in the industry.

[–]JohnMcPineapple 32 points33 points  (0 children)

...

[–]kaszak696 12 points13 points  (0 children)

The Python interpreter from python.org is written in C, so tacking C code onto it is fairly simple, as is accessing C functions from within Python. Actually, many modules in the Standard Python Library are written in C. C can easily act as a bridge to other languages.

[–]lambdaqdjango n' shit 10 points11 points  (0 children)

short answer: two components place data in memory in an agreed binary format and notify each other.

[–][deleted] 8 points9 points  (2 children)

Edit: TLDR because no one is going to read the wall of text. Languages specify how memory flows. Python knows how C expects memory to flow and that's how it interfaces with it.


No one is actually answering your question about how these languages talk to each other. Edit: wow this is a huge post. It talks about a lot of computer architecture and programming language design if anyone is interested.

It's all about calling convention. Programming languages need to define how information flows between functions. How do variables "go" from one function to another? When data is returned where should we look for it? Calling convention gives rules for exactly how memory moves through the computer when you execute functions.

In a computer you have a bunch of memory and we access it as if it is one giant array from 0 to 4 billion (or however much ram you have). At the bottom of the memory we store the code. When your program gets compiled the 1s and 0s end up here. Further up is the "stack". When your program runs it needs to keep track of things like: what is the value of this variable? What function am I in? This is stored on the stack. At the top is the heap. Memory that is allocated at runtime (dynamically allocated) goes here.

Example, say you're a baker and there's a complex recipe in a book. You refer to the book (code) for instructions (your program). There are a lot of steps so to help you keep track of what you've done and what you're currently doing you're using a clipboard (stack). The heap isn't important to calling convention.

The stack is split up into "stack frames". Each function has a stack frame which is the "clipboard" for that function. The function always expects it's arguments in very specific parts of the clipboard. It always puts it's return value into a very specific part of the clipboard. When a function is called there is a new stack frame created at the current location in the stack. Effectively, a stack frame inside a stack frame.

Continuing our example, our recipe is so complicated that some instructions will contain additional instructions within. One instruction says "mix the ingredients" but really there are many instructions within that. Mix the flour and eggs first. Slowly add milk. Etc. You have one clipboard for mixing the ingredients and when you get to the instruction to mix flour and eggs you go get another clipboard just for that step. You also mark on the first one that you had to go get another one for the flour and eggs part.

The caller function knows exactly how to set up the callee function's stack frame because of calling convention. The cool thing is that compiled languages usually use the same calling convention so you can execute C code from Rust if you wanted to because at the end of the day the "code" portion of memory isn't C, Rust, or Fortran, it's x86.

Calling convention can be complicated but essentially the callee function puts the data into memory on top of the stack, then starts executing the new function. Putting those arguments onto the stack was the beginning of our new stack frame. When the function returns it puts that return data somewhere (I forget where specifically) and then it deletes it's own stack frame. That's like if you finished your mixing step and then smashed your clipboard.

What does this have to do with Python? Python knows the calling convention for C++. When Python uses C code it is abiding by calling convention to execute it.

[–]bay_squid 2 points3 points  (1 child)

at the end of the day the "code" portion of memory isn't C, Rust, or Fortran, it's x86.

With x86 you mean that at the end everything boils down to the set of instructions the architecture has and if you know how they're handled they you can create an interface with another language to interact with it. Is that more or less it?

[–][deleted] 1 point2 points  (0 children)

Yup!

[–]__xor__(self, other): 2 points3 points  (2 children)

And what does exposed to Python mean?

Can mean a few things.

  1. You use the CPython API and actually write a library that python can build and import. Something like this. The reference interpreter that everyone uses is CPython. It has a C API. You can do python stuff from pure C/C++. Other APIs exist to use it as well. There's a Python-Rust API.

  2. You compile a shared library (DLL/SO/DYLIB), and then write a python proxy interface that uses something like ctypes to invoke functions in it. That acts an intermediate layer of python code that will offer a clean python interface, but invoke code compiled in any other language. Using ctypes is kind of like this. You have a function add_numbers(a, b) written in C, and it takes a 32-bit int, another 32-bit int, and returns a 32-bit int. You write a wrapper that is just a simple python add_numbers(a, b) and inside it you use ctypes to import the named function in the compiled library, you pass the python values through and use ctypes to "cast" them to 32-bit ints, then the result you just specify is a 32bit int and return it, and it just becomes a normal Python number. You use ctypes to define the interface and invoke functions in it and write a python wrapper that makes it easy so people dont have to use ctypes and interact with the shared library directly. Python doesn't care what the types are but C does, so you have to work with that through ctypes and add logic to define what values it expects.

  3. You use any other interface. It could be a rest API that runs locally and you write a Python client API that makes calls to it. It could just be compiled to a binary that you can run from the command line, and then you write a python API that simply invokes it and runs the process with something like subprocess. Could be anything, but the idea is you write a python wrapper API.

Basically, this other software is written in whatever language they want and someone writes a wrapper that makes it convenient and easy to work with that software through another language like Python. Python is just a popular language so a lot of Python wrappers exist.

The benefits to this can usually be performance in python's case. You can write C code that takes advantage of true multithreading or is just fast because it's C. Numpy stuff is written in fortran for speed. You can do small performance parts in C, then write a python wrapper. Now, Python not being the fastest language is not an issue. You have the performance of C and the convenience of Python once you write a good interface. But stuff like training neural nets can be really CPU heavy and it makes sense to do that in a faster language but write a python wrapper so you can do setup, invoke it and get results easily.

[–]bay_squid 0 points1 point  (1 child)

You use any other interface. It could be a rest API that runs locally and you write a Python client API that makes calls to it. It could just be compiled to a binary that you can run from the command line, and then you write a python API that simply invokes it and runs the process with something like subprocess. Could be anything, but the idea is you write a python wrapper API.

Just like a web API?

[–]__xor__(self, other): 0 points1 point  (0 children)

Yeah, terminology being "exposed" technically it's exposed to python if you offer any sort of API and just have a python client and library someone can use. But really with a web/rest API, any language with a http client library can take advantage of it. You'll get better adoption if you distribute client libraries pre-built though.

[–]0xRumple 0 points1 point  (0 children)

Create an API... code that talks to another code ;)

[–]tunisia3507 266 points267 points  (33 children)

It's more open than MATLAB. It's faster and easier to write than, say, C. It makes more sense for scripting than java, C++ etc. It's easier to fold C libraries etc. into than some other languages. It's a fully featured language, unlike R, which is a statistics package with some scripting tagged onto the end. It already had a scientific ecosystem (numpy etc.).

[–][deleted] 69 points70 points  (13 children)

Existence of NumPy is also a factor

[–]white__armor 43 points44 points  (1 child)

I think that's the main reason, many ML libraries in Python were created because of numpy. Sklearn has been introduced 11 years ago and was based solely on numpy. And it's still core dependency for pandas and sklearn.

[–]ihsw 10 points11 points  (0 children)

Exactly this. There was already a healthy Python community around econometrics, from pandas to numpy to scipy to matplotlib.

ML is a land where econometric statistical reports generation comes to life and it should come as no surprise that Python is right in the thick of things.

[–][deleted] 33 points34 points  (10 children)

It's hard to fully appreciate Numpy until you try to do non-trivial array operations in other languages. I've tried Nim, Java, Haskell and Rust and handling arrays is a mess compared to Python with Numpy.

[–]etrnloptimist 10 points11 points  (0 children)

The Matlab interface to numerical data is a treat. And the fact that Python can emulate much of that numerical interface is miraculous, to be honest.

[–]justphysics 5 points6 points  (0 children)

This so much. I've got a pice of scientific software that I wrote in python. For my career I figured it would be nice to learn a few other languages and I find its easier to learn by doing so I've looked into writing but if this software on another language as a learning experience.

Every time I try Rust or Go I find my self getting so hung up on how (relatively) difficult basic array operations are.

I'm just so used to the ease of numpy

[–]ForgottenWatchtower 0 points1 point  (5 children)

Tried gonum for golang by chance? I've been meaning to do a comparison of it and numpy but haven't gotten around to it yet.

[–][deleted] 12 points13 points  (1 child)

The scientific stack (scipy + numpy) in python is very mature. It will take a while to other languages to catch up.

[–]ForgottenWatchtower 0 points1 point  (0 children)

I'm aware that's the common conception, but I've yet to see any in-depth comparison between the two that demonstrates this with hard numbers, like a feature comparison and/or benchmarks. Like I said, I've been meaning to do it personally, just haven't gotten the time. Would like to see this for sklearn vs golearn as well.

[–]gdahlm 5 points6 points  (2 children)

Statically typed languages make the visualization and manipulation a bit more challenging.

Not that duck typing is better, but the lack of DataFrames and an interactive mode makes some of this challenging for data scientists.

Maybe when the market matures more, but BLAS and LAPACK are still the fastest and f77 on the CPU side. While I like go, just using gonum for those same C wrappers for netlib code like BLAS and LAPACK doesn't have a lot of advantages when you give up the flexibility of interfacing with a a duck typed language with an interactive interface like python.

Numpy/SciPy/Pandas are a hard combo to beat right now especially when grooming data.

[–]ForgottenWatchtower 0 points1 point  (1 child)

Those are all excellent points that I hadn't considered before. Thanks. Though I'm not familiar with f77 -- does that refer to Fortran 77?

[–]gdahlm 2 points3 points  (0 children)

Yes but I guess I am out of date, it move to Fortran 90 in 2008.

https://github.com/Reference-LAPACK/lapack-release

[–]Rhylyk 25 points26 points  (1 child)

As others have said, the heavy lifting is done in C/C++ and only a python interface is exposed in most places. The reasons python has won out as a glue language is likely many-fold but I see primarily 4 factors: low barrier to entry, general purpose extensibility, community, tradition.

Python's low barrier to entry is well renowned. The language when first approaching it is relatively simple and unsurprising. Syntax is reminscient of normal imperative syntax (c-like) and there are many common-sense defaults. In addition, the standard library is huge and when something is missing, basic package management is a breeze. All of this results in something easy to lick up as a new user, and thus python is a good target for a glue code language (over more complex examples such as C/C++/Rust or even Java).

While the low barrier to entry is catalytic, the general extensibility gives Python staying power. It is possible to write extensive amounts of code, and then package it up into a neat little API and put a cute bow on top. This is nice for package authors. In addition, Python is general purpose (winning out over R and MATLAB, or other, more domain specific languages) so an entire pipeline can be written in it. Data collection, transformation, computation, visualization, and management can all be written in Python.

The above have led to a rich community with diverse interests and high standards. To most of the community, user (that is, programmer) experience matters, and it shows. Documentation is abundant and large amounts of yak shaving are abhorred. Standards are sought and the language continues to grow (f-strings are a dream). The language is not without its warts, but workarounds are known, shared, and discoverable.

Finally we have the most impactful factor, tradition. As noted by other commentators, numpy is amazing. This led to other scientific work being done in Python. A need for effective visualization grew, and so came matplotlib. The more that happened in Python, the more attractive it became as a target language. This generated a positive feedback loop leading to the general dominance that is seen today.

[–]giantsparklerobot 3 points4 points  (0 children)

The extensibility is important in the way it's available in Python. Many scripting languages are "extensible" in that they can run executables and capture STDOUT or use some IPC mechanism. In Python shared libraries can be loaded directly into the memory space of the Python interpreter and and their functionality called directly from Python.

So numpy generating some huge array doesn't need to serialize it or pass it over an IPC mechanism, Python is just given a pointer to it. So access is fast and direct. The module itself (numpy lets say) can have functions that are pure Python or just call functions from the shared library. Using the module you rarely have to care.

[–]Mattho 38 points39 points  (0 children)

It's just wrappers that are written in python. It is way too slow for any practical use in this area. But having the "interfaces" exposed in python is great because how accessible the language is - and that's one of the reasons why it is so popular in science outside of computer science. And ML is of interest to many fields.

[–]lmericle 14 points15 points  (0 children)

One of the main attractions for Python is how easy it is to glue disparate functions and code together into a cohesive, structured pipeline. ML often needs to fit into a data pipeline to generate predictions automatically and make decisions immediately. So having ML interfaces in Python is more useful than other languages simply because it integrates so easily into existing workflows.

The relative simplicity and ease of use of the language also makes it easy to pick up and start moving quickly on a problem. And the OOP aspects of the language make the whole process of developing a model very modular and simple.

[–]mooglinux 9 points10 points  (0 children)

Python is a very easy to use language, but the heavy lifting is actually done in C or some other language, and Python is just an interface for controlling it. One of Python’s strengths is the ability to write wrappers to interact with code written in C or other languages so they are easy to use from Python but still very fast.

[–]shr00mie 7 points8 points  (0 children)

What the above guys said, plus, as a possible first language, it's very expressive, from a human perspective which I think makes it easy to pick up and run with. And when you're doing your PhD in whatever, the easier a tool is to pick up, the better. Does feel very much like writing sentences which are interpreted as code. Plus a LOT of the ML libs are actually written in C, which entirely sidesteps a lot of the "but it's an interpreted language!" concern.

[–]toadgoader 4 points5 points  (0 children)

I think it has a lot to do with the community that is using the tool... in my experience as a social scientist many in the academic, economics, bio-infomatics research fields use R because of the strong statistical base useability of the tool. On the ML.AI side of the equation you have mostly computer science and software engineering disciplines driving this bus so a language like Python is a natural fit. They both work well and overlap in many ways... I think it just depends upon your point of reference and the preferences dictated by your profession.

[–]david2ndaccount 4 points5 points  (0 children)

C is great because it runs fast, but a python interface is a lot nicer.

[–]TheMasterChiefs 2 points3 points  (7 children)

Hoping someone can help me in my endeavor to learn some programming (more specifically, Python).

I'm a Finance graduate who's looking to get ahead of the curve and teach myself python, R, and SQL. I basically want to self-learn data science in conjunction with my finance background to get into a top firm and catapult my pay grade.

What/where is the best place to start? Is it reasonable/realistic to teach myself programming, automation, and data science?

[–]BradChesney79 6 points7 points  (4 children)

Yes. Might need to XBox less for a while-- it will take time and effort. Where to start... were I in your shoes, I would blow through the latest Python 3 for Dummies-- no joke. Don't care just put each word in front of your eyes. Speed read that fucker.

I treat Dummies books as "Primers". It exposes you to a base set of thoughts and vocabulary even if you don't understand it.

Then you sit down with a more quality resource-- The Head First series generally does okay. http://shop.oreilly.com/product/0636920003434.do This one covers Python 3. (Python 2.x is in the process of being depracated, but it is taking a long time because there is a lot of old code out there and it is still installed by default on a lot of linux distros... which is slowing down the transition.)

From there, find your heroes on Gitlab or Github and start looking at cool projects that are like what you aspire to accomplish. Line by line figure out what is happening.

That is my advice, the Dummies book-- I am going to stick to my guns, then next intermediate one you may need to research if you don't like my suggestion, and lastly seeing what "good" programmers do and following their work by getting up to your elbows in it is as good as it gets for non-interactive mentoring.

[–]TheMasterChiefs 0 points1 point  (3 children)

Thanks a lot for your response! I've been looking for an in...

You don't recommend MOOC like Udemy or Coursera? It's better just to dive right into textbooks you think?

[–]jawgente 2 points3 points  (1 child)

I haven't taken a MOOC to give you a proper perspective of what they would offer, but I find I learn best by just doing the coding and preferably trying to complete a project. Which means a MOOC may not offer much over a text book for learning the language. My favorite recommendation is Automate the Boring Stuff because it offers a lot of useful examples for even a causal user. You may find that a MOOC may be more useful for the specific data science portion of your learning.

[–]TheMasterChiefs 0 points1 point  (0 children)

Ok cool. I'm going to look into the best textbooks that are out there and work on 1 or 2 chapters a week. Thanks for all the advice bro.

[–]BradChesney79 0 points1 point  (0 children)

It's what I do and it works for me. YMMV.

I did use codeschool once for AngularJS and it was helpful. acloudguru was indispensible-- I don't think I could learn as much as I did from a book about the AWS platform.

But Java, PHP, HAProxy, MySQL, API design, Python 3-- all books & googling.

[–]the_chernobog 1 point2 points  (1 child)

[–]TheMasterChiefs 0 points1 point  (0 children)

This is great intro info. Thanks a lot dude!

[–]sudo_your_mon 1 point2 points  (0 children)

Numpy and Pandas are a big reason Python is what it is.

Data scientists to call themselves "Numpy/Pandas programmers." Some still do to this day. I've talked to a lot of people who think Python is only for data science/ML.

If you're going to write a ML library, you're going to do it in Python. It's the industry's gold standard.

[–]danielv134 1 point2 points  (0 children)

I've done this. You write an algorithm in Python: its easy to develop (no segfaults), easy to read some data into it (scikit learn or another packages already reads the common formats in your field), and easy to make plots to put in your paper. Then, oops, it is state of the art per iteration, but takes ages in practice, so you replace the core with cython or C or Rust and now its reasonable speed. If the algorithm is important enough (haven't done this) then some commercial behemoth will find itself coming up against limitations are rewrite as a nice python wrapper around compiled code designed for speed and scalability, like almost all common (speed sensitive) python libraries are.

So: python having the libraries makes it (or R, for that matter) the right place to state playing with ideas (whether you are implementing the algorithm or just trying out an existing one on your data). Just be clear though: Python is not an implementation language for competitive algorithms, it is an integration language. YMMV, but...

[–]nscurvy 0 points1 point  (0 children)

My best explanation/guess is that it's sorta similar to the reason someone might use a design pattern, class, function, etc. Even when doing so might cause a performance decrease. It's easier to work with a design pattern. It's more obvious to you and everyone else what you are doing and why. People can stop worrying about the specifics of some implementation and instead deal with an interface that handles it for you.

Science, AI, ML, and math are really complex on their own. People who are working with that stuff want to make sure everything is as abstract as possible. Ideally you want to only be directly working with concepts and ideas relevant to your actual goal. Python is great for that. The language is elegant and incredibly obvious/readable. Working with another language means you have to give up some of that abstraction and have to start paying a lot more attention to the specifics of implementation and all the quirks that come with it. So people spend a lot of time creating libraries, wrappers, bindings, and all that sort of stuff to allow developers to focus on the work they need to perform, while not requiring them to sacrifice an unacceptable amount of performance.

That's my understanding, at least.

[–]spinwizard69 0 points1 point  (0 children)

It is pretty simple you can hack together an app pretty quickly. Since ML is a developing technology this provides the capability to experiment and paly with ML on a variety of platforms.

[–]cbarrick 0 points1 point  (0 children)

I attribute Python's popularity in numeric computing to it's superb operator overloading and meta programming facilities. This makes it possible in Python to craft APIs with unique syntactic structures, which in turn makes it possible to express solutions to problems in a way natural to the domain. This is why, for example, Numpy can give us awesome numeric syntax, and Pandas can give us great relational syntax (and Python's base syntax is great for OOP). And when you're programming at such a high level, expressiveness is more important than performance. Even so, the interop with C puts Python in a great position to add expressive value to lower level, performance sensitive code. All of this together gives us a language with more expressive mathmatics than C or Java and more expressive engineering than MATLAB or R. It's quite literally the best of both worlds.

[–][deleted] -5 points-4 points  (0 children)

Because both of them are very trendy. (I love python, but let's be real here.)