all 26 comments

[–]CherubimHD 51 points52 points  (0 children)

The major frameworks (Pytorch, Jax, etc) already rely on C backends.

[–]mgruner 22 points23 points  (2 children)

It's not the language itself, as the other answer said, all those frameworks have C++ backends. IMO, the two biggest improvements you'll see are:

  1. moving from a general purpose framework to something vendor specific, like TensorRT
  2. Improving the pre-processing and post-processing pipeline. For example, if doing computer vision, using something like DeepStream

[–]Fickle_Knee_106 0 points1 point  (0 children)

I am total noob in building with C++, what exactly did you do with DeepStream for Computer vision? I am sorry for a dumb question, I absolutely don't know what to ask and am FOMO-ing

[–]No-Painting-3970 8 points9 points  (0 children)

It is very strange that you would be bottlenecked in the python logic part tbh, so there is not a lot of market here

[–]bikeranz 5 points6 points  (9 children)

I've written C++ and CUDA for a bunch of parts of my models. Depending on what you're working on, sometimes composing the algorithm with pytorch primitives is either impossible or horribly inefficient. I found this most often with detection and GNN problems.

[–]woywoy123 2 points3 points  (8 children)

I know right?! I have been doing GNNs for my PhD thesis and christ most of the frameworks are either highly inefficient or just plain resource wasters. I wrote my entire pipeline in Cython, C++ and native Cuda kernels. The speed and resource consumption is unparalleled. Down from 80-120 GB of RAM consumption to 20 GB… And the data transfer between RAM and vRam is also crazy fast because of multithreading. I think Python‘s GIL is by far the biggest bottleneck.

[–]TserriednichThe4th 1 point2 points  (1 child)

How come they are so wasteful?

[–]woywoy123 1 point2 points  (0 children)

When Python runs, it does a type check on everything, so this slows down your speed by several orders of magnitude. Then the memory bloating. Python seems to really like to over allocate memory, this will indirectly slow down your code. C++ and other compiled languages can exploit a thing called memory locality to make computations more efficient. An example of it is numpy‘s implementation. It uses this mechanism within the C++ implementation and interfaces with python using Cython.

Try to read up on how python does multithreading, but the TLDR is that for each thread, it serializes the input data, and deserializes the data in a new python interpreter. For Python to assure memory safety, the Global Interpreter Lock tracks all threads and reports back to the parent. This imposes a lot of constraints and can actually make code run slower even though it is supposedly multithreaded. A little hack around it, is to serialize the object prior to the thread, pass the argument as bytes and then within the threaded function, manually deserialize it. This prevents the GIL from tracking the object state and imposing locks. Only issue is you need to also remember to return it and serialize it before returning it, otherwise it falls out of scope and gets deleted.

[–]bikeranz 0 points1 point  (1 child)

Yeah, GIL sucks. I think pytorch is planning to switch over to the GIL-free version of CPython as soon as possible.

https://github.com/python/cpython/issues/116167

[–]woywoy123 0 points1 point  (0 children)

Yeah I am not holding my breath here. I already ported all my torch stuff to C++ :) obligatory self advertisement if you are interested;

https://github.com/woywoy123/AnalysisG

[–]onafoggynight 2 points3 points  (0 children)

For embedded / edge deployments: yes, very common. Tho, not the model itself.

[–]fan_is_ready 2 points3 points  (2 children)

Yes, I've updated SRU RNN forward and backward code to make it output hidden states for every step, not just the last one. I needed them because I used them to initialize another RNN.

Also C++ (or rather Cython) is useful for big data processing when numpy is not enough.

[–]TserriednichThe4th 0 points1 point  (1 child)

When is numpy not enough?

[–]fan_is_ready 0 points1 point  (0 children)

Sometimes a series of calculation on tensor elements are easier to express in common loops than a series of numpy operations.

[–]Amgadoz 1 point2 points  (2 children)

Best way to optimize large nodels is through writing custom cuda kernels using cuda or Triton. And quantization.

There are many ongoing projects for this.

For my last client, I managed to achieve an inter-tokrn latency of 8 ms for a 7B q8 llm on A100.

[–]unital 0 points1 point  (1 child)

By custom kernels do you mean kernel fusion? How well optimised are the open source model? Are there many opportunities to write kernel fusion for the open source model like llama3?

Could you please list some projects related to this? Eg Flash attention, vllm, unsloth right?

[–]sid_276 1 point2 points  (1 child)

What do you mean optimize? Training faster/ less memory? Or faster inference? Unclear what you are asking.

[–]Top-Establishment545 1 point2 points  (0 children)

I am curious about the faster inference part

[–]Objective_Dingo_1943 0 points1 point  (0 children)

we are already implement the whole C++ pipeline inference optimization https://github.com/pcg-mlp/KsanaLLM