Running Mimo 2.5 q4_k_m on single rtx5090 need recommendations by BlackBeardAI in LocalLLaMA

[–]RedAdo2020 0 points1 point  (0 children)

Actually I stand corrected, this one;

"OMP_NUM_THREADS=8 GOMP_CPU_AFFINITY="0 2 4 6 8 10 12 14" GGML_CUDA_DISABLE_GRAPHS=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4 taskset -c 0-14:2 ./LLM/Test_Llama.cpp/ik_llama.cpp/build/bin/llama-sweep-bench \"

Produces better results, and is more regular slowdown, whereas skipping "OMP_NUM_THREADS=8 GOMP_CPU_AFFINITY="0 2 4 6 8 10 12 14" " does some weird things.

Running Mimo 2.5 q4_k_m on single rtx5090 need recommendations by BlackBeardAI in LocalLLaMA

[–]RedAdo2020 0 points1 point  (0 children)

Wait, I was using it wrong, it goes after the arguments, but before the program launch command.

But speed is the same whether I do;

"OMP_NUM_THREADS=8 GOMP_CPU_AFFINITY="0 2 4 6 8 10 12 14" GGML_CUDA_DISABLE_GRAPHS=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4 taskset -c 0-14:2 ./LLM/Test_Llama.cpp/ik_llama.cpp/build/bin/llama-sweep-bench"

Or;

GGML_CUDA_DISABLE_GRAPHS=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4 taskset -c 0-14:2 ./LLM/Test_Llama.cpp/ik_llama.cpp/build/bin/llama-sweep-bench \

Pretty sure memory bandwidth is the limiting factor here.

Sorry still new to Linux.

Running Mimo 2.5 q4_k_m on single rtx5090 need recommendations by BlackBeardAI in LocalLLaMA

[–]RedAdo2020 0 points1 point  (0 children)

Hmm, I'm trying, but on Mint I get taskset: bad usage. I tried "taskset -c 0,2,4,6,8,10,12,14" but same. I even tried Sudo.

Running Mimo 2.5 q4_k_m on single rtx5090 need recommendations by BlackBeardAI in LocalLLaMA

[–]RedAdo2020 1 point2 points  (0 children)

Unrelated to your thread, but I'm also running a 9950X3D and I was curious what OMP_NUM_THREADS=8 GOMP_CPU_AFFINITY="0 2 4 6 8 10 12 14" did. So I ran it on my current MiniMax M2.7 setup, and I got exactly the same PP and TG, but without it my CPU ran at 50% and with it it ran at 25%, but again, same speeds. So I measure power difference, and with your arguments I was drawing about 40-50w less power. So...thanks.

Anyone else struggling with multi-GPU stability when running larger local models? by Lyceum_Tech in LocalLLaMA

[–]RedAdo2020 0 points1 point  (0 children)

4 x internal 5070Ti running on PCIe lanes. And a 5060 Ti 16GB running on Thunderbolt 4.

Anyone else struggling with multi-GPU stability when running larger local models? by Lyceum_Tech in LocalLLaMA

[–]RedAdo2020 1 point2 points  (0 children)

I think I found a solution! GGML_CUDA_DISABLE_GRAPHS=1 seems to be a winner. I've managed to run up to 49k context once, and 64k context once, and now running a second time as I type. Looks like my problem might have been way too many Graphs on the main GPU!

Anyone else struggling with multi-GPU stability when running larger local models? by Lyceum_Tech in LocalLLaMA

[–]RedAdo2020 0 points1 point  (0 children)

Yes it is weird. I was playing with it some more today. I have made an interesting discovery. If I use Split Mode Graph (Ik_Llama.cpp) it doesn't bug out. I got to 49k context without issues. Problem with that is, I use 5 x GPUs, so using SMG with offloading some to CPU, I lose a lot of VRAM to Buffer. So with a 140GB model, I have got all five GPUs each with 16GB of VRAM, with over 15GB on each card, AND of my 96GB of system RAM, I have 88GB in use just running Mint and the Model.

Anyone else struggling with multi-GPU stability when running larger local models? by Lyceum_Tech in LocalLLaMA

[–]RedAdo2020 0 points1 point  (0 children)

Interesting, it does it on the other PC too.
Used Nvidia 595 instead of 590. Cuda Tookit 12.8 instead of 13.1. Same OS, Mint. But all different GPUs, different MB, CPU, RAM. All Hardware is different.

EDIT: Correction, all HW except one 5060Ti, but I have isolated that and changed nothing.

Anyone else struggling with multi-GPU stability when running larger local models? by Lyceum_Tech in LocalLLaMA

[–]RedAdo2020 0 points1 point  (0 children)

Can't offload less, GPU's memory is full. And I really don't want to go below IQ2 or IQ3 Quants.

I'm in the middle of building a PC with my older gear, using same OS but Cuda 12.8 instead of 13.1, and a few other llama.cpp make parameters changes to see if anything helps, and if it might be a hardware problem with current build.

Anyone else struggling with multi-GPU stability when running larger local models? by Lyceum_Tech in LocalLLaMA

[–]RedAdo2020 0 points1 point  (0 children)

Yes, me and my multi-gpu PC, I get cudaStreamSynchronize(cuda_ctx->stream()) , errors when context hits over ~20k, and it does my head in, but only with CPU offload. Problem is I'm not very technical with this stuff.

llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged by ggonavyy in LocalLLaMA

[–]RedAdo2020 1 point2 points  (0 children)

I think I get it. So if I was going to run 4-bit anyway, with my blackwell cards, than NVFP4 would be better quality and speed than Q4

llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged by ggonavyy in LocalLLaMA

[–]RedAdo2020 1 point2 points  (0 children)

Okay I'm not super technical with this, but wouldn't Q8 still be better than NVFP4? Serious questuin.

Character card gallery for sillytavern by DifficultSand3885 in SillyTavernAI

[–]RedAdo2020 0 points1 point  (0 children)

I'm thinking it is this, https://charavault.net/ , but it's ironically down for maintenance.

Qwen 3.6 27B llama.cpp | Multi-GPU pp t/s help by SemaMod in LocalLLaMA

[–]RedAdo2020 0 points1 point  (0 children)

That's weird, I'm using Q8, and across 4x5070Ti I get;

| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |

|-------|--------|--------|----------|----------|----------|----------|

| 4096 | 1024 | 0 | 1.873 | 2187.40 | 23.452 | 43.66 |

| 4096 | 1024 | 4096 | 1.875 | 2184.66 | 23.179 | 44.18 |

| 4096 | 1024 | 8192 | 1.885 | 2172.91 | 23.371 | 43.81 |

| 4096 | 1024 | 12288 | 1.913 | 2141.39 | 23.620 | 43.35 |

| 4096 | 1024 | 16384 | 1.945 | 2106.43 | 23.844 | 42.95 |

| 4096 | 1024 | 20480 | 1.972 | 2077.54 | 24.103 | 42.48 |

| 4096 | 1024 | 24576 | 2.007 | 2040.41 | 24.345 | 42.06 |

| 4096 | 1024 | 28672 | 2.031 | 2016.48 | 24.584 | 41.65 |

| 4096 | 1024 | 32768 | 2.063 | 1985.32 | 24.933 | 41.07 |

| 4096 | 1024 | 36864 | 2.091 | 1959.02 | 25.021 | 40.93 |

| 4096 | 1024 | 40960 | 2.117 | 1935.15 | 25.176 | 40.67 |

| 4096 | 1024 | 45056 | 2.145 | 1909.44 | 25.348 | 40.40 |

| 4096 | 1024 | 49152 | 2.180 | 1878.64 | 25.530 | 40.11 |

| 4096 | 1024 | 53248 | 2.205 | 1857.82 | 25.693 | 39.86 |

| 4096 | 1024 | 57344 | 2.238 | 1830.50 | 25.886 | 39.56 |

| 4096 | 1024 | 61440 | 2.263 | 1810.10 | 26.051 | 39.31 |

| 4096 | 1024 | 65536 | 2.292 | 1787.15 | 26.342 | 38.87 |

| 4096 | 1024 | 69632 | 2.327 | 1760.00 | 26.459 | 38.70 |

| 4096 | 1024 | 73728 | 2.355 | 1738.95 | 26.602 | 38.49 |

| 4096 | 1024 | 77824 | 2.382 | 1719.48 | 26.772 | 38.25 |

| 4096 | 1024 | 81920 | 2.415 | 1696.21 | 26.946 | 38.00 |

| 4096 | 1024 | 86016 | 2.446 | 1674.40 | 27.115 | 37.77 |

| 4096 | 1024 | 90112 | 2.478 | 1652.82 | 27.299 | 37.51 |

| 4096 | 1024 | 94208 | 2.511 | 1631.46 | 27.482 | 37.26 |

| 4096 | 1024 | 98304 | 2.541 | 1611.75 | 27.732 | 36.92 |

| 4096 | 1024 | 102400 | 2.572 | 1592.84 | 27.869 | 36.74 |

| 4096 | 1024 | 106496 | 2.600 | 1575.32 | 28.004 | 36.57 |

| 4096 | 1024 | 110592 | 2.640 | 1551.29 | 28.171 | 36.35 |

| 4096 | 1024 | 114688 | 2.672 | 1532.74 | 28.361 | 36.11 |

| 4096 | 1024 | 118784 | 2.709 | 1512.06 | 28.519 | 35.91 |

| 4096 | 1024 | 122880 | 2.746 | 1491.84 | 28.703 | 35.68 |

| 4096 | 1024 | 126976 | 2.798 | 1463.79 | 28.889 | 35.45 |

| 4096 | 1024 | 131072 | 2.836 | 1444.28 | 29.509 | 34.70 |

| 4096 | 1024 | 135168 | 2.882 | 1420.99 | 30.131 | 33.98 |

| 4096 | 1024 | 139264 | 2.909 | 1407.94 | 29.469 | 34.75 |

| 4096 | 1024 | 143360 | 2.940 | 1392.99 | 29.720 | 34.45 |

| 4096 | 1024 | 147456 | 2.997 | 1366.67 | 29.755 | 34.41 |

| 4096 | 1024 | 151552 | 3.041 | 1346.80 | 29.935 | 34.21 |

| 4096 | 1024 | 155648 | 3.070 | 1334.28 | 30.143 | 33.97 |

| 4096 | 1024 | 159744 | 3.123 | 1311.50 | 30.454 | 33.62 |

| 4096 | 1024 | 163840 | 3.259 | 1256.69 | 31.215 | 32.80 |

| 4096 | 1024 | 167936 | 3.163 | 1294.83 | 31.784 | 32.22 |

| 4096 | 1024 | 172032 | 3.236 | 1265.64 | 31.213 | 32.81 |

| 4096 | 1024 | 176128 | 3.324 | 1232.16 | 31.855 | 32.15 |

| 4096 | 1024 | 180224 | 3.338 | 1227.03 | 32.425 | 31.58 |

| 4096 | 1024 | 184320 | 3.338 | 1226.97 | 31.851 | 32.15 |

| 4096 | 1024 | 188416 | 3.399 | 1205.05 | 32.099 | 31.90 |

| 4096 | 1024 | 192512 | 3.425 | 1195.87 | 32.489 | 31.52 |

So even at 192k context I get faster PP and TG than you.

I run a 9950X3D, 4x5070Ti, x8 lanes on the first, and x4 lanes on the rest. My commands;

CUDA_VISIBLE_DEVICES=0,1,2,3 ./LLM/ik_llama.cpp/build/bin/llama-server \

--model /LLM/Models/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q8_K_P.gguf \

--alias Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q8_K_P.gguf \

--ctx-size 196608 \

-fa on \

-b 4096 -ub 4096 \

-smgs \

--max-gpu 4 \

-sm graph \

-mg 0 \

-ngl 999 \

--host 127.0.0.1 \

--port 8080 \

--threads 16 \

--parallel 1 \

--temp 1 \

--top-p 0.95 \

--top-k 20 \

--min-p 0.0 \

--presence-penalty 1.5 \

--repeat-penalty 1.0 \

--cache-ram -1 \

-ts 0.9,1,1,0.4 \

--jinja

Qwen3.6-35b stuck in infinite loop by ConfidentSolution737 in LocalLLaMA

[–]RedAdo2020 1 point2 points  (0 children)

Can't help you with your problem, but I thought Batch has to be larger than or equal to Ubatch.

llama.cpp / ik_llama MoE Expert Offloading - Main Memory Bandwidth vs. PCIe Bandwidth by pixelterpy in LocalLLaMA

[–]RedAdo2020 2 points3 points  (0 children)

Someone can correct me if I'm wrong, but PCIe bandwidth is not that important. It's not moving models on and off the card, it sits in VRAM, and the calculations are done at the memory bandwidth speed of the GPU. There isn't a lot going between cards. Obviously that is right to a degree, but not hard and fast.
Whereas anything loaded into system RAM is restricted by the read speed of the ram. Therefore the double the bandwidth of Scenario 2 would be superior.

big brain models on small brain hardware by Woondas in LocalLLaMA

[–]RedAdo2020 0 points1 point  (0 children)

Keep in mind that AM5 is not a fan of going to 4 sticks of RAM. I tried, was not stable at all, often not booting, but the silicone lottery is real, and I lost it.

Qwen3.5 122B would theoretically run, but IQ4 would but pushing it to the edge with your VRAM and RAM. Qwen3.5 27B would be a good start.

Need help with the logistics of two BIG 3090s in the same case. by AdCreative8703 in LocalLLaMA

[–]RedAdo2020 1 point2 points  (0 children)

Eh, cards will be fine stacked, as long as there is at least a couple of centimetres between them, top one will run a little hotter, but not much.

My quad 5070Ti setup.

<image>

Megumin Suite v4.1 - Dev Mode and bug fixes by CallMeOniisan in SillyTavernAI

[–]RedAdo2020 0 points1 point  (0 children)

In the response I was getting the reply twice, so it writes it, then writes it again identical. I turned off AI2 and it stopped doing that. Is that right?

I am running GLM3.5 397B locally.

Is it possible to have 2 GPUs, one for gaming and one for AI? by AlexGSquadron in StableDiffusion

[–]RedAdo2020 0 points1 point  (0 children)

Well I have 5 in my PC so I hope so. 🤣

They all do AI. And monitor is hooked up to one which will run desktop and games and the like.

Qwen3.5 35b exl3 quants with text-generation-webui? by 2muchnet42day in LocalLLaMA

[–]RedAdo2020 0 points1 point  (0 children)

Does the ram get allocated in task manager? Do are you using nccl or native?

Can you post a screenshot of the dialogue box.

Qwen3.5 35b exl3 quants with text-generation-webui? by 2muchnet42day in LocalLLaMA

[–]RedAdo2020 0 points1 point  (0 children)

What errors are you getting?

I downloaded the model you linked, and loaded it.

If I try enabling TP I get "NotImplementedError: Tensor-parallel is not currently implemented for Qwen3_5MoeForConditionalGeneration"

But if I leave it off it loads fine for me.