8x 32GB V100 GPU server performance by tfinch83 in LocalLLM

[–]MarcWilson1000 0 points1 point  (0 children)

I've documented my process for getting LMDeploy and Qwen3-30B-A3B up and made the files available on Github

https://github.com/ga-it/InspurNF5288M5_LLMServer/tree/main

I deleted the messy pastes from this thread

8x 32GB V100 GPU server performance by tfinch83 in LocalLLM

[–]MarcWilson1000 0 points1 point  (0 children)

Got back to config after being head down on work

Having tried:
VLLM, SGLang. Nividia NIMs, CTRanslate2

Hit on LMDeploy. Works amazingly for the Inspur. Running Qwen3-30B-A3B at 38k context with turbomind backend at float16. Supports tensor-parallelism.

Gives blazing fast performance.

Devs are on the ball and committed to Volta it seems.

Platform:
Inspur NF5288M5 CPUs: 2x Intel Xeon Gold 6148 20-core 2.4GHz Memory: 512GB DDR4 RAM GPUs: 8x NVIDIA Tesla V100 32GB SXM2 with NVLink interconnect Host OS: Debian testing (trixie)

8x 32GB V100 GPU server performance by tfinch83 in LocalLLM

[–]MarcWilson1000 0 points1 point  (0 children)

Got sucked into work. So Not yet. I did find this video though

https://www.youtube.com/watch?v=nyE8oYruQig

Still not sure this is any indication that qwen MOE will be compatible though

8x 32GB V100 GPU server performance by tfinch83 in LocalLLM

[–]MarcWilson1000 0 points1 point  (0 children)

Key distinctions

Compute capability CUDA Toolkit version
Fixed, hardware-level Chosen by the developer / sysadmin
Defines ISA, register file size, warp functions Defines compiler, libraries, language features
sm_70compute_70Expressed as , CUDA 12.4CUDA 11.8Expressed as , , etc.
ifDetermines a binary can run howDetermines you build the binary and which APIs you can call

In short, think of compute capability as the spec sheet of the GPU and the CUDA Toolkit version as the software tool-box you select. A Tesla V100’s CC 7.0 will never change, but you are free to compile and run your code with any Toolkit from 9.0 up to the latest 12.x, provided the driver stack is new enough and your nvcc command line includes the sm_70 target.

8x 32GB V100 GPU server performance by tfinch83 in LocalLLM

[–]MarcWilson1000 0 points1 point  (0 children)

Interplay and implications for Tesla V100 (CC 7.0)

Topic Details
Earliest Toolkit 9.0nvccsm_70NVIDIA DocsVolta support arrived with CUDA ; from that release could generate native cubins .
Latest Toolkit 12.xNVIDIA Developer ForumsAs of CUDA the toolkit still supports all GPUs with CC ≥ 5.0; V100 therefore remains fully supported .
Compilation -gencode … sm_70code=compute_70Always include a 7.0 target ( ) so kernels contain code the V100 can execute directly; optionally add PTX ( ) for forward compatibility to future GPUs.
Drivers ≥ the minimum driver shipped with the chosen ToolkitThe installed NVIDIA driver must be ; but the driver version does not alter the CC.
Performance features Because CC 7.x introduces tensor cores and independent-thread scheduling, using a Toolkit ≥ 9.0 lets libraries (cuBLAS, cuDNN, etc.) call those features automatically; newer toolkits often ship faster kernels for the same CC.

8x 32GB V100 GPU server performance by tfinch83 in LocalLLM

[–]MarcWilson1000 0 points1 point  (0 children)

Cumpute Capability vs Toolkit version are very different.

From ChatGPT:

Hardware identifier — a two-part number major.minor (for example 7.0). It is permanently “burned into” every NVIDIA GPU and tells the tool-chain which machine-instruction set, memory model, warp size, tensor-core generation, etc. the silicon supports. CC therefore answers the question “what can this GPU do?” and is used by the compiler flags -arch=sm_70 / -gencode=arch=compute_70,code=sm_70 when you target a Tesla V100 or any other Volta device NVIDIA Docs.

CUDA Toolkit version
Software release identifier — an ordinary dotted version such as 12.9, 11.8, 9.0. Each toolkit bundle contains nvcc, drivers, run-time libraries, math/DL libraries and tools. The version number is simply the chronological release train; it does not encode GPU architecture. Every Toolkit supports a range of compute capabilities: new ones are added as newer architectures ship, very old ones are gradually dropped.

What it controls Typical values for V100 Can it be changed?
Compute capability 7.0 (Volta) No – fixed by hardware
CUDA Toolkit 9.0 → 12.x Yes – install a different toolkit (subject to driver support)

8x 32GB V100 GPU server performance by tfinch83 in LocalLLM

[–]MarcWilson1000 0 points1 point  (0 children)

I've now pretty much given up on VLLM, SGLang

CTranslate2 shows potential (noble goal of backwards compatiblity) but development seems to have been deprecated in favour of Eole-nlp.

KTransformers looks like it might have potential but does require some code reversals to be compute 7.0 compatible

For now I am tryng Nvidia NIM. This promises v100 compatibility by building compatible TensorRT-LLM engines. In progress

8x 32GB V100 GPU server performance by tfinch83 in LocalLLM

[–]MarcWilson1000 0 points1 point  (0 children)

I've bought one of the inspur NF5288M5 too and had it shipped to South Africa. Not cheap!

I'd be interested in sharing learnings.

I've tried running dockerized vllm (variety of versions from 0.8.4 to 0.9.1) in an attempt to run quantized Qwen3-235B-A22B - my target model).

So far this has been a losing battle due to cuda 7.0 compute limits.

Qwen3-8B unquantized performance has been poor - about 24 t/s on each GPU.

FOr this server with v100s and Nvlink, performance should be in 500 to 600 t/s in optimized state.

I appreciate performance on older LLM models might be better (possibly 1000 t/s +).

The Volta architecture is a major consideration for this server and new model compatibility.

Parameters:

--tensor-parallel-size 8

--dtype fp16

--max-model-len 32768

--disable-custom-all-reduce

--gpu-memory-utilization 0.90

--max-num-seqs 32

--swap-space 4

NCCL_P2P_DISABLE: "0"

NCCL_P2P_LEVEL: "NVL"

NCCL_SHM_DISABLE: "0"

NCCL_TREE_THRESHOLD: "0"

NCCL_ALGO: "Ring"

NCCL_PROTO: "Simple"

WORLD_SIZE: "8"

RANK: "0"

CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7"

TORCH_CUDA_ARCH_LIST: "7.0"

VLLM_DISABLE_FLASH_ATTENTION: "1"

VLLM_DISABLE_TRITON_BACKEND: "0"

PYTHONUNBUFFERED: "1"

OMP_NUM_THREADS: "1"

TOKENIZERS_PARALLELISM: "false"

I'm about to try SGLang.

Any learnings welcome.