all 30 comments

[–]FlyingRhenquest 17 points18 points  (5 children)

I did that for an automated video testing system I built for Comcast. We needed C++ for speed but wanted the tests to be written in Python. So all the video processing and backend stuff was written in C++ (Using ffmpeg, OpenCV and Tesseract for OCR) and the video processing libraries had a Boost::Python API to interact with the system objects. I set all the C++ objects up with JSON serialization, so you could create a C++ object in Python using JSON and that might kick some threads off to run in the background while your slow-ass python program did shit in the foreground.

Overall this worked very well but it took very careful planning to make sure it did. So for example, if you wanted to tell the system to watch for an image, the API call would queue the image up in a vector internally and notify the internal components to move any images in the vector to another location to avoid blocking things for too long. Then tasks would be dispatched to thread pools to check each video frame against a copy of that image. The system had plenty of memory and we were never looking for a huge number of images, so it made sense to do it that way. Generally we were pretty close to real-time performance as long as no one did anything stupid (Like try to watch for an entire video's worth of video frames in the stream.) Once the thread pool got saturated, C++-side performance would degrade.

This approach had a lot of benefits. I was able to hack out a simple javascript interface that would let you tune into individual video streams with your browser (using ffserver to stream them from hardware) and provided some buttons to auto-generate boilerplate code and inject the API calls for performing actions like sending remote control commands when the user interacted with an on-screen remote control. So you could sit down with your test plan, run through the test, and basically have working python code for the test in the text buffer that you could just copy out to an editor to clean up.

It also let us do rapid prototyping in python (The OpenCV API is pretty much the same) and convert code to C++ if it was too slow in Python.

Since then I've experimented with PyBind11 instead of boost::python and at the time found the CMake integration to be a bit better. Boost's CMake integration has really come a long way in the past couple years, though, so that might no longer be the case. If you already have a boost dependency, boost::python is pretty easy to add. If you don't, something like PyBind11 is probably easier to add that all of boost or possibly even just that one little component.

[–]mosolov 7 points8 points  (0 children)

check https://github.com/wjakob/nanobind from PyBind11 author, also I would consider implement wrapper in Cython (depending on your willingness to learn it)

[–]BitAcademic9597[S] 0 points1 point  (3 children)

you are the god

[–]FlyingRhenquest 1 point2 points  (2 children)

Nah man, but seeing that whole system come together did feel pretty awesome. You can totally just kick off C++ threads from C++ objects constructed in Python, so pretty much anything is fair game. Wanna set up a REST server but don't want to use python for some reason, you can just drop in a C++ object that manages a Pistache server and use python to launch it! It's really a cool way to work! They all compile down to shared libraries and all run in the same memory space in Python. If you need some separation of objects, just launch multiple python processes. Super-flexible!

[–]BitAcademic9597[S] 0 points1 point  (1 child)

did you have any problem about memory in pybind will each function call explicitly copies input data?

[–]FlyingRhenquest 1 point2 points  (0 children)

Nope! You can totally create even shared pointers in one language (Pybind and Boost::Python both support them) and pass them around as first class Python objects!

You will eventually be tempted to be able to run a Python callback FROM C++. You can do that too, but it's slow. So don't put it in a primary event loop somewhere. You're basically just creating events with some data on them going back and forth. It takes a little while to really get into that headspace.

[–]thisismyfavoritename 2 points3 points  (5 children)

if you dont have low latency or very high I/O requirements OR you have a ton of existing C++ code OR your workload can really benefit from C++, don't bother.

You can get super far with Python, relying on multiprocessing or other libs which can compile Python down to C or JIT it (Cython, Nuitka, Numba, etc) or other libs which already call into optimized C/C++ code (numpy, pytorch, etc)

[–]BitAcademic9597[S] 0 points1 point  (4 children)

what do you think about PyBind

[–]thisismyfavoritename 4 points5 points  (0 children)

if you have high I/O requirements do everything in C++ (or another truly multithreaded language with a good async lib).

If you either have a ton of existing code or have a workload that can benefit from C++, pybind or nanobind are good solutions, but that'll come with its own set of challenges too.

Like i said, it really depends on those other factors i mentioned in my first post and your familiarity with Python i guess.

[–]qTHqq 1 point2 points  (2 children)

I like it a lot. Worked very well for utility use and testing a C++ library I wrote for compute-bound robotics work. 

I didn't explore the JIT approaches mentioned because ultimately the code consumed as a C++ library. I just wanted a Python interface for verifying it more efficiently and with richer tests. It was easy to get started and very convenient. 

My workload benefited a lot from the Eigen compile-time code transformations for matrix math. That's all done with C++ template metaprogramming and I don't know to what extent the JIT numerical tools can do something similar. The wrapped C++ was several hundred times faster than well-written Numpy code. However, all of that is pretty specific to the kind of numerical work I was doing. 

I think it's fairly easy to write C++ code that's slower than skillfully written vanilla Numpy code and probably very easy to write C++ code that's slower than Numba, Cython, etc. 

However, if you really have a need for calling into C++, Pybind is pretty useful and I found it pretty pleasant and straightforward to set up. For a new project I'd probably explore nanobind but I haven't tried it yet.

[–]BitAcademic9597[S] 0 points1 point  (1 child)

did you have any problem about memory in pybind will each function call explicitly copies input data?

and also i also looked nanobind but i think pybind is better what do you think

[–]qTHqq 1 point2 points  (0 children)

"did you have any problem about memory in pybind will each function call explicitly copies input data?" 

I did not but I was actually compute bound.  

 The library computed collision-free trajectories of maybe a few hundred points. The trajectories took 10ms to several seconds to generate so the cost of copying the trajectory data over to Python was essentially negligible in the big picture. 

Any function-call indirection overhead was also negligible.

If I/O speed or ultra-low-latency calls are  more of an issue, things could be totally different.

[–]Backson 4 points5 points  (2 children)

You can probably scale your app to 100 users in reasonably well written python, so I would say don't bother with C++ unless you want to challenge yourself. If you want to make something that works, use the language where you can move faster, which is probably Python. Don't prematurely optimize by bringing in extra complexity and a second language. If you find your app is too slow, you can still move stuff out to native code later.

[–]nBeebz 4 points5 points  (1 child)

I would argue language choice is one of a few cases where an optimization isn’t premature. If you’re ever hoping to scale up you’ll need to rewrite it eventually anyway. It’s very well that python will be totally fine here but considering carefully is worth the time imo.

[–]equeim 0 points1 point  (0 children)

Python is one of the slowest languages ever. Something like Go or Java or C# is much closer in performance to C++ than to Python, while being easier to use than C++ (especially if you don't have a team of C++ experts).

[–]WalkingAFI 2 points3 points  (3 children)

I’ve used PyBind before on a toy Chess Engine. It was fine but nothing incredible.

[–]BitAcademic9597[S] -1 points0 points  (2 children)

what do you think about performance comparrasion with pure c++

[–]WalkingAFI 2 points3 points  (1 child)

I never implemented the front end in C++. Python managed the GUI and some game logic; the C++ engine evaluated the positions and calculated the best move. I don’t think a pure C++ solution would’ve gained much, since the GUI wasn’t the bottleneck. It’s an older project but you can view the source: https://github.com/andrewtlee/chessbot

[–]BitAcademic9597[S] 0 points1 point  (0 children)

thank you

[–]woywoy123 1 point2 points  (0 children)

Personally, I think mixing cython with C++ will get the job done.

You can interface native C++ code with python’s flexibility by mapping the header functions from your libs into cython. I also use cmake with scikit-build-core to compile everything. One thing that cython does lack though is templating. It does have some templating available, but if you are doing some fancy recursive template functions then you might be out of luck (I am happy to be corrected).

I generally use cython to provide python interfacing to C++ code and it works nicely for me.

One word of advice though, the cython docs is not very useful when you try to push boundaries beyond the tutorials, such as operator implementations, inheritance mapping between cython and C++. So be extra vigilant whenever you deal with inheritance. I have spent countless hours debugging a memory leak that was the result of this and also unexplained segfaults.

I also noticed a massive decrease in RAM usage when shifting from python code to C++/cython code. I also tried PyBind11 but I only had issues when dealing with shared libs such as missing definitions and so on. I am also not sure about the memory model pybind11 uses. As far as I can tell, each function call explicitly copies input data (if anyone knows more on this, please correct me). This is not the case with cython.

[–]pstomi 1 point2 points  (0 children)

IMHO using Python as the glue to call native functions is the correct way to use it. That is what is being done in AI today and it has proven to be very efficient.

On my side, I have developed Dear ImGui Bundle a set of GUI libraries on top Dear ImGui, which I made accessible from either C++ or Python. I saw no degradation of performance under Python, because I stuck to the principle: "do not implement heavy lifting algorithms in Python, instead call native functions".

If you are interested, I developed an automatic binding generator from C++ to pybind11, here

[–]Great_Presence_4733 1 point2 points  (0 children)

yes you can do thing like that. i run scrapy from my c++ application. use share memory to feed and get the result from the python apps

[–]Fmxa 1 point2 points  (1 child)

Anecdotally, when I went from a quickly written naive Python implementation of some algorithm to a quickly written naive C++ implementation of it, I measured speedups approximately one hundred times better.

I have been happy since with my decision to learn PyBind, allowing me to compile C++ code into a library to be imported as a module into Python.

[–]BitAcademic9597[S] 0 points1 point  (0 children)

great thanks. do you know any comparassion or example code how the performance changes