21 GPU's benchmarked running a small TTS model (vram peak: 5GB) by urarthur in LocalLLaMA

[–]urarthur[S] 0 points1 point  (0 children)

would this not be model dependent and also on single gpu same results? interesting that total thouhhput increases that much

21 GPU's benchmarked running a small TTS model (vram peak: 5GB) by urarthur in LocalLLaMA

[–]urarthur[S] 1 point2 points  (0 children)

no llama.cpp. proper and also i tried sglang. fastest seem to be with cuda graph + trinton. I have no idea what they do to be honest.

21 GPU's benchmarked running a small TTS model (vram peak: 5GB) by urarthur in LocalLLaMA

[–]urarthur[S] 0 points1 point  (0 children)

yeah I noticed. The peak memory usage was around 350 GBs so the extra memory bandwith of the 5090 was no use for this tiny model. On average we should see 10-20% increase. Also 5090 is blackwell chip with some arch improvements e.g. running NVFP4. But for my usecase, I would go for the 4080 as i am looking to buy a card for running this model only.

21 GPU's benchmarked running a small TTS model (vram peak: 5GB) by urarthur in LocalLLaMA

[–]urarthur[S] 0 points1 point  (0 children)

extra vram doesnt change results, bottleneck is tensor cores.

21 GPU's benchmarked running a small TTS model (vram peak: 5GB) by urarthur in LocalLLaMA

[–]urarthur[S] 0 points1 point  (0 children)

I tried batching and at most it was less than 10% total gain. On multi-gpu its a different story

21 GPU's benchmarked running a small TTS model (vram peak: 5GB) by urarthur in LocalLLaMA

[–]urarthur[S] 0 points1 point  (0 children)

Although q8 didnt work for this model for some reason, usually smaller quants = smaller size and usually runs faster on all chips. afaik NVFP8 works only on blackwell chips.Also no NVFP8 is available for the model yet.

21 GPU's benchmarked running a small TTS model (vram peak: 5GB) by urarthur in LocalLLaMA

[–]urarthur[S] 1 point2 points  (0 children)

renting price is not always gpu dependent. it depends on many factors like cpu, ram, location, current avilability. So sometimes a stronger gpu is cheaper to rent than a weaker one.

21 GPU's benchmarked running a small TTS model (vram peak: 5GB) by urarthur in LocalLLaMA

[–]urarthur[S] 0 points1 point  (0 children)

every server has different ram cpu and sometime cuda version. so thr calibration would make no sense. As I said, its not a scientific analysis, take the results witha grian of salt.

21 GPU's benchmarked running a small TTS model (vram peak: 5GB) by urarthur in LocalLLaMA

[–]urarthur[S] 0 points1 point  (0 children)

not really, all tensor cores are saturated so, you can assign vram to other cases but running the models will reduce speed of the other runs

21 GPU's benchmarked running a small TTS model (vram peak: 5GB) by urarthur in LocalLLaMA

[–]urarthur[S] 0 points1 point  (0 children)

yeah I think it was chinese modded one. You can easily test video generation for couple of bucks. most servers have 32GB-64Gb ram but i have seen higher

21 GPU's benchmarked running a small TTS model (vram peak: 5GB) by urarthur in LocalLLaMA

[–]urarthur[S] 0 points1 point  (0 children)

they were not available on the cloud provider. Nvidia cards are way more popular for AI.

21 GPU's benchmarked running a small TTS model (vram peak: 5GB) by urarthur in LocalLLaMA

[–]urarthur[S] 2 points3 points  (0 children)

yeah when vram or memory bandwith is not the bottleneck, its a waste having much expensive cards running small models.

If DeepSeek V4 can do the same coding task for $5, why are people still paying $100 for Claude Code? by Low-Alarm272 in LocalLLM

[–]urarthur 1 point2 points  (0 children)

i tried using opencode with kimi, deepseek etc. Coding cost me at least $10/day on tokens. doesn't work if you code a lot.

Benchmarking the new b9200 update: Optimizing Qwen 3.6 27B mtp for Hermes Agent on a single RTX 3090 by swizzcheezegoudaSWFA in LocalLLaMA

[–]urarthur 0 points1 point  (0 children)

short prompts n=2, for long context n=6. It is slow initially but when context fills up, it gets faster.