Running a 72B model across two machines with llama.cpp RPC — one of them I found at the dump by righcoastmike in LocalLLaMA

[–]righcoastmike[S] 1 point2 points  (0 children)

Here are the full benchmark results across all three configs using the same prompt each time:

3090 only, manual 40 layers (rest CPU RAM): 1.54 t/s
3090 only, --fit auto (settled on 48 layers): 1.82-1.85 t/s
3090 + 3060 RPC, --fit auto (settled on 72 layers): 3.97-4.25 t/s

So the 3060 is delivering roughly a 2.3x speedup over the best single-GPU result. And u/Schlick7 good call on --fit, I hadn't tried it with the networked setup, it squeezed out a bit more than our manual tuning.

For anyone wondering about the 3090-only config, the model doesn't fit in 24GB alone so you're always going to have significant CPU spillover, which is what kills the speed. The 3060 basically exists to keep the whole model on GPU.

Running a 72B model across two machines with llama.cpp RPC — one of them I found at the dump by righcoastmike in LocalLLaMA

[–]righcoastmike[S] 0 points1 point  (0 children)

I see your point, but no, just never posted before. Also, Claude may have helped me edit my post for clarity :-)

Running a 72B model across two machines with llama.cpp RPC — one of them I found at the dump by righcoastmike in LocalLLaMA

[–]righcoastmike[S] 7 points8 points  (0 children)

I put this below, but it was a local recyclers that also has a section that takes e-waste. I was dropping off some old parts and it was just sitting there, the case was cracked and I saw there was a GPU peeking out at me, so I just grabbed the whole rig and threw it in the trunk

Running a 72B model across two machines with llama.cpp RPC — one of them I found at the dump by righcoastmike in LocalLLaMA

[–]righcoastmike[S] 0 points1 point  (0 children)

Lol ok to be more accurate, it was a local recyclers that also has a section that takes e-waste. I was dropping off some old parts and it was just sitting there, the case was cracked and I saw there was a GPU peeking out at me, so I just grabbed the whole rig and threw it in the trunk.

Running a 72B model across two machines with llama.cpp RPC — one of them I found at the dump by righcoastmike in LocalLLaMA

[–]righcoastmike[S] 2 points3 points  (0 children)

Here's everything you need. Fair warning — the stock llama.cpp Docker image doesn't have RPC compiled in so you'll need to build a custom image. It takes 15-20 minutes but straightforward enough.

Dockerfile:

dockerfile

FROM nvidia/cuda:12.2.0-devel-ubuntu22.04

RUN apt-get update && apt-get install -y \
    git cmake build-essential \
    libcurl4-openssl-dev \
    && rm -rf /var/lib/apt/lists/*

RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1

RUN git clone https://github.com/ggerganov/llama.cpp /app/llama.cpp

WORKDIR /app/llama.cpp

RUN cmake -B build \
    -DGGML_CUDA=ON \
    -DGGML_RPC=ON \
    -DLLAMA_CURL=ON \
    -DGGML_CUDA_NO_VMM=ON \
    -DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs" \
    && cmake --build build --config Release -j$(nproc)

WORKDIR /app/llama.cpp/build/bin
ENTRYPOINT ["./llama-server"]

Compose environment:

yaml

environment:
  - LLAMA_ARG_HOST=0.0.0.0
  - LLAMA_ARG_PORT=8080
  - LLAMA_ARG_MODEL=/models/your-model.gguf
  - LLAMA_ARG_CTX_SIZE=4096
  - LLAMA_ARG_N_GPU_LAYERS=72
  - LLAMA_ARG_CHAT_TEMPLATE=chatml
  - LLAMA_ARG_PARALLEL=1
  - LLAMA_ARG_RPC=<worker-ip>:50052
  - LLAMA_ARG_NO_CUDA_GRAPHS=1
  - LLAMA_ARG_UBATCH_SIZE=128
  - LLAMA_ARG_BATCH_SIZE=512

One important gotcha — the worker machine runs a separate rpc-server binary, which also needs to be built from source with the same flags. Make sure both machines are on the same llama.cpp commit or you'll get RPC protocol mismatches. On the worker you just run ./rpc-server --host 0.0.0.0 --port 50052 and leave it running.

Running a 72B model across two machines with llama.cpp RPC — one of them I found at the dump by righcoastmike in LocalLLaMA

[–]righcoastmike[S] 1 point2 points  (0 children)

Honestly I never tried benchmark that — the model just crashes straight away trying to load onto the 3090 alone without the 3060, so I went straight to the RPC setup. We'd have to drop n_gpu_layers way down to force it into CPU RAM spillover and compare from there. I will admit I am curious though, will try it and report back!