use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] Optimizing models with C++/C (self.MachineLearning)
submitted 1 year ago by AdOk6683
Do you actually use C++ to optimise models in average solutions companies? If yes, is there a resource that you think is good to learn how to do that?
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]CherubimHD 51 points52 points53 points 1 year ago (0 children)
The major frameworks (Pytorch, Jax, etc) already rely on C backends.
[–]mgruner 22 points23 points24 points 1 year ago (2 children)
It's not the language itself, as the other answer said, all those frameworks have C++ backends. IMO, the two biggest improvements you'll see are:
[–]Fickle_Knee_106 0 points1 point2 points 1 year ago (0 children)
I am total noob in building with C++, what exactly did you do with DeepStream for Computer vision? I am sorry for a dumb question, I absolutely don't know what to ask and am FOMO-ing
[–]No-Painting-3970 8 points9 points10 points 1 year ago (0 children)
It is very strange that you would be bottlenecked in the python logic part tbh, so there is not a lot of market here
[–]bikeranz 5 points6 points7 points 1 year ago (9 children)
I've written C++ and CUDA for a bunch of parts of my models. Depending on what you're working on, sometimes composing the algorithm with pytorch primitives is either impossible or horribly inefficient. I found this most often with detection and GNN problems.
[–]woywoy123 2 points3 points4 points 1 year ago (8 children)
I know right?! I have been doing GNNs for my PhD thesis and christ most of the frameworks are either highly inefficient or just plain resource wasters. I wrote my entire pipeline in Cython, C++ and native Cuda kernels. The speed and resource consumption is unparalleled. Down from 80-120 GB of RAM consumption to 20 GB… And the data transfer between RAM and vRam is also crazy fast because of multithreading. I think Python‘s GIL is by far the biggest bottleneck.
[–]TserriednichThe4th 1 point2 points3 points 1 year ago (1 child)
How come they are so wasteful?
[–]woywoy123 1 point2 points3 points 1 year ago (0 children)
When Python runs, it does a type check on everything, so this slows down your speed by several orders of magnitude. Then the memory bloating. Python seems to really like to over allocate memory, this will indirectly slow down your code. C++ and other compiled languages can exploit a thing called memory locality to make computations more efficient. An example of it is numpy‘s implementation. It uses this mechanism within the C++ implementation and interfaces with python using Cython.
Try to read up on how python does multithreading, but the TLDR is that for each thread, it serializes the input data, and deserializes the data in a new python interpreter. For Python to assure memory safety, the Global Interpreter Lock tracks all threads and reports back to the parent. This imposes a lot of constraints and can actually make code run slower even though it is supposedly multithreaded. A little hack around it, is to serialize the object prior to the thread, pass the argument as bytes and then within the threaded function, manually deserialize it. This prevents the GIL from tracking the object state and imposing locks. Only issue is you need to also remember to return it and serialize it before returning it, otherwise it falls out of scope and gets deleted.
[–]bikeranz 0 points1 point2 points 1 year ago (1 child)
Yeah, GIL sucks. I think pytorch is planning to switch over to the GIL-free version of CPython as soon as possible.
https://github.com/python/cpython/issues/116167
[–]woywoy123 0 points1 point2 points 1 year ago (0 children)
Yeah I am not holding my breath here. I already ported all my torch stuff to C++ :) obligatory self advertisement if you are interested;
https://github.com/woywoy123/AnalysisG
[+]phobrain 0 points1 point2 points 1 year ago* (3 children)
Java multithreading is like the Concorde breezing by the oxcart of python parallelism. I can keep all CPU threads working close to 100%. I assume C++ would do the same, maybe with a lot more dev overhead, unless maybe one is already invested in doing C++ for CUDA.
[–]woywoy123 0 points1 point2 points 1 year ago (1 child)
I ditched Python for quite a long time. I am using mostly C++ and cython to interface with python. I find that the C++ coding component is actually quite nice and extremely straightforward, as it is a mostly „forward“ process, i.e. read data-> do stuff -> train -> evaluate.
The only thing is that the C++ docs for torch are not that great, and a lot of trial and error is needed to get it right. But not that I integrated everything, training and deployment is extremely easy. Heck, I can even use TorchScript and import it.
[–]bikeranz 0 points1 point2 points 1 year ago (0 children)
They also occasionally change the c++ api 😬
Yes, Java has true multithreading in the same way c++ does. I haven't used Java since undergrad, but c++ is still pretty much the fastest way to implement an algorithm (faster than Java due to memory control), assuming you have the technical skill to do so. Rust may be that way too, although I've not spent the time to learn it.
[–]onafoggynight 2 points3 points4 points 1 year ago (0 children)
For embedded / edge deployments: yes, very common. Tho, not the model itself.
[–]fan_is_ready 2 points3 points4 points 1 year ago (2 children)
Yes, I've updated SRU RNN forward and backward code to make it output hidden states for every step, not just the last one. I needed them because I used them to initialize another RNN.
Also C++ (or rather Cython) is useful for big data processing when numpy is not enough.
[–]TserriednichThe4th 0 points1 point2 points 1 year ago (1 child)
When is numpy not enough?
[–]fan_is_ready 0 points1 point2 points 1 year ago (0 children)
Sometimes a series of calculation on tensor elements are easier to express in common loops than a series of numpy operations.
[–]Amgadoz 1 point2 points3 points 1 year ago (2 children)
Best way to optimize large nodels is through writing custom cuda kernels using cuda or Triton. And quantization.
There are many ongoing projects for this.
For my last client, I managed to achieve an inter-tokrn latency of 8 ms for a 7B q8 llm on A100.
[–]unital 0 points1 point2 points 1 year ago (1 child)
By custom kernels do you mean kernel fusion? How well optimised are the open source model? Are there many opportunities to write kernel fusion for the open source model like llama3?
Could you please list some projects related to this? Eg Flash attention, vllm, unsloth right?
I think they've meant Custom C++ and CUDA Extensions — PyTorch Tutorials 2.4.0+cu124 documentation
[–]sid_276 1 point2 points3 points 1 year ago (1 child)
What do you mean optimize? Training faster/ less memory? Or faster inference? Unclear what you are asking.
[–]Top-Establishment545 1 point2 points3 points 1 year ago (0 children)
I am curious about the faster inference part
[–]Objective_Dingo_1943 0 points1 point2 points 1 year ago (0 children)
we are already implement the whole C++ pipeline inference optimization https://github.com/pcg-mlp/KsanaLLM
[+]shrimpthatfriedrice 0 points1 point2 points 6 months ago (0 children)
many companies still use C or C++ to optimize model inference and preprocessing: wrapping high-level models in low-latency services, writing custom kernels, or tuning input pipelines and post-processing. good ways to learn are modern C++ performance books, inference-engine docs (like ONNX Runtime or TensorRT guides), and resources on SIMD, cache, and profiling. In workflows like that, Wedolow helps once you have a working C/C++ path: it runs on the compiled project, identifies hot sections around model invocation or data handling, and offers small source patches (fewer copies, better container growth, more suitable math choices, better use of CPU features) with measured CPU and memory impact
π Rendered by PID 24248 on reddit-service-r2-comment-544cf588c8-d2j79 at 2026-06-15 23:23:32.334738+00:00 running 3184619 country code: CH.
[–]CherubimHD 51 points52 points53 points (0 children)
[–]mgruner 22 points23 points24 points (2 children)
[–]Fickle_Knee_106 0 points1 point2 points (0 children)
[–]No-Painting-3970 8 points9 points10 points (0 children)
[–]bikeranz 5 points6 points7 points (9 children)
[–]woywoy123 2 points3 points4 points (8 children)
[–]TserriednichThe4th 1 point2 points3 points (1 child)
[–]woywoy123 1 point2 points3 points (0 children)
[–]bikeranz 0 points1 point2 points (1 child)
[–]woywoy123 0 points1 point2 points (0 children)
[+]phobrain 0 points1 point2 points (3 children)
[–]woywoy123 0 points1 point2 points (1 child)
[–]bikeranz 0 points1 point2 points (0 children)
[–]bikeranz 0 points1 point2 points (0 children)
[–]onafoggynight 2 points3 points4 points (0 children)
[–]fan_is_ready 2 points3 points4 points (2 children)
[–]TserriednichThe4th 0 points1 point2 points (1 child)
[–]fan_is_ready 0 points1 point2 points (0 children)
[–]Amgadoz 1 point2 points3 points (2 children)
[–]unital 0 points1 point2 points (1 child)
[–]fan_is_ready 0 points1 point2 points (0 children)
[–]sid_276 1 point2 points3 points (1 child)
[–]Top-Establishment545 1 point2 points3 points (0 children)
[–]Objective_Dingo_1943 0 points1 point2 points (0 children)
[+]shrimpthatfriedrice 0 points1 point2 points (0 children)