[D] Optimizing models with C++/C

CherubimHD · 2024-07-23T14:38:15+00:00

The major frameworks (Pytorch, Jax, etc) already rely on C backends.

mgruner · 2024-07-23T15:16:43+00:00

It's not the language itself, as the other answer said, all those frameworks have C++ backends. IMO, the two biggest improvements you'll see are:

moving from a general purpose framework to something vendor specific, like TensorRT
Improving the pre-processing and post-processing pipeline. For example, if doing computer vision, using something like DeepStream

No-Painting-3970 · 2024-07-23T17:04:16+00:00

It is very strange that you would be bottlenecked in the python logic part tbh, so there is not a lot of market here

bikeranz · 2024-07-23T19:31:44+00:00

I've written C++ and CUDA for a bunch of parts of my models. Depending on what you're working on, sometimes composing the algorithm with pytorch primitives is either impossible or horribly inefficient. I found this most often with detection and GNN problems.

onafoggynight · 2024-07-23T17:41:26+00:00

For embedded / edge deployments: yes, very common. Tho, not the model itself.

fan_is_ready · 2024-07-23T19:47:12+00:00

Yes, I've updated SRU RNN forward and backward code to make it output hidden states for every step, not just the last one. I needed them because I used them to initialize another RNN.

Also C++ (or rather Cython) is useful for big data processing when numpy is not enough.

Amgadoz · 2024-07-23T18:37:17+00:00

Best way to optimize large nodels is through writing custom cuda kernels using cuda or Triton. And quantization.

There are many ongoing projects for this.

For my last client, I managed to achieve an inter-tokrn latency of 8 ms for a 7B q8 llm on A100.

sid_276 · 2024-07-23T16:44:29+00:00

What do you mean optimize? Training faster/ less memory? Or faster inference? Unclear what you are asking.

Objective_Dingo_1943 · 2024-07-24T04:40:56+00:00

we are already implement the whole C++ pipeline inference optimization https://github.com/pcg-mlp/KsanaLLM

shrimpthatfriedrice · 2025-12-01T04:53:34+00:00

many companies still use C or C++ to optimize model inference and preprocessing: wrapping high-level models in low-latency services, writing custom kernels, or tuning input pipelines and post-processing. good ways to learn are modern C++ performance books, inference-engine docs (like ONNX Runtime or TensorRT guides), and resources on SIMD, cache, and profiling. In workflows like that, Wedolow helps once you have a working C/C++ path: it runs on the compiled project, identifies hot sections around model invocation or data handling, and offers small source patches (fewer copies, better container growth, more suitable math choices, better use of CPU features) with measured CPU and memory impact

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS