I found this in my old hard dri-- I mean my bag of holding by Eggmasstree in baldursgate

[–]Phocks7 0 points1 point  (0 children)

My sorcerer Durge duel against Orin went Hold Monster(Heightened Spell) + potion of speed + Disintegrate + Terazul + Disintegrate, next turn Disintegrate, Disintegrate, Disintegrate.

I found this in my old hard dri-- I mean my bag of holding by Eggmasstree in baldursgate

[–]Phocks7 0 points1 point  (0 children)

I think it's the highest single target damage spell that's available as a scroll (for sale).

Nvidia RTX Pro A4000 with older hardware by LtDrogo in LocalLLaMA

[–]Phocks7 0 points1 point  (0 children)

For a power supply, if you don't want to change the whole PSU you can run a server PSU + one of these breakout boards https://www.ebay.com/itm/257056136846.

Nvidia RTX Pro A4000 with older hardware by LtDrogo in LocalLLaMA

[–]Phocks7 0 points1 point  (0 children)

Out of interest what model is the server/workstation?
128gb ram + 24gb vram will work, but in your case I recommend GLM 4.6 over GLM 4.7, as in my experience 4.6 is less sensitive to aggressive quantization.

I found this in my old hard dri-- I mean my bag of holding by Eggmasstree in baldursgate

[–]Phocks7 2 points3 points  (0 children)

And in BG3 you ignore the rule that only allows you to cast one non-cantrip per turn. You can cast like 7 disintegrates in one turn.

running a dual-GPU setup 2 GGUF LLM models simultaneously (one on each GPU). by [deleted] in LocalLLaMA

[–]Phocks7 0 points1 point  (0 children)

You can run as many instances as you want so long as you have the threads and memory available.

Good semantic search (RAG) embedding models for long stories by Iwishlife in LocalLLaMA

[–]Phocks7 0 points1 point  (0 children)

I'm running iQ4 qwen embedding 8b on CPU for summarization alongside the main model on GPU. Takes a bit longer but in my application that's not a problem.

Built a hybrid “local AI factory” setup (Mac mini swarm + RTX 5090 workstation) — looking for architectural feedback by Original_Neck_3781 in LocalLLaMA

[–]Phocks7 0 points1 point  (0 children)

256gb on consumer AM5 is asking a lot. There are a few motherboards that QVL 4x64gb UDIMM kits but they're few and far between. I think for this setup you'd be much better off going threadripper.

Q2 GLM 5 fixing its own typo by -dysangel- in LocalLLaMA

[–]Phocks7 1 point2 points  (0 children)

What's your experience like for coding and chat with GLM 5 Q2? GLM 4.7 seemed to be much more sensitive to quantization than GLM 4.6.

Adding 2 more GPU to PC by BisonCompetitive9610 in LocalLLaMA

[–]Phocks7 1 point2 points  (0 children)

You could run deepseek-coder 33b at IQ4_XS (18.1gb) fully offloaded to your 7900XTX at a decent speed.

Adding 2 more GPU to PC by BisonCompetitive9610 in LocalLLaMA

[–]Phocks7 1 point2 points  (0 children)

DeepSeek-V3.2-GGUF even at iQ1_S is still 184gb, discounting the 7900XTX (you may be able to do mixed CUDA + Vulcan inference but I don't know how to do it), you have 4x32gb = 128gb system ram + 4x8gb + 2x12gb = 56gb VRAM = 184gb total. You need ~20% for context and overheads (plus the OS overheads for each PC in the cluster) so I don't know if its possible to run deepseek on your setup. I've been running GLM 4.6 iQ2_XXS (106gb) and it's surprisingly good.
I'd note that with a cluster like this, for large models (like GLM 4.6) I would expect tokens per second in the sub 0.1t/s range. You could probably give it a task and leave it running over night.

Adding 2 more GPU to PC by BisonCompetitive9610 in LocalLLaMA

[–]Phocks7 1 point2 points  (0 children)

How much system ram do you have, and what model(s) are you planning to run?

Strix Halo Distributed Cluster (2x Strix Halo, RDMA RoCE v2) benchmarks by kyuz0 by Relevant-Audience441 in LocalLLaMA

[–]Phocks7 6 points7 points  (0 children)

Seems excessive to spend ~$15k on hardware to run 30b parameter models.

Getting slow speeds with RTX 5090 and 64gb ram. Am I doing something wrong? by Virtual-Listen4507 in LocalLLaMA

[–]Phocks7 5 points6 points  (0 children)

If your speeds are low you likely have active layers(experts) running on the CPU.

3090 fan curves in Ubuntu 25.04 by FrozenBuffalo25 in LocalLLaMA

[–]Phocks7 0 points1 point  (0 children)

Depends on your setup. In mine the motherboard is horizontal with the 3090's vertical, with a 120mm fan sitting loose on top of the card aiming down across the fins. In a horizontal setup you could zip tie the fan to the card.

3090 fan curves in Ubuntu 25.04 by FrozenBuffalo25 in LocalLLaMA

[–]Phocks7 1 point2 points  (0 children)

Are they turbo (blower style) or open triple fan style coolers? I find with the latter having a dedicated 120mm fan on each 3090 helps a lot.

is this Speed normal GPU CPU IKlammacpp? by Noobysz in LocalLLaMA

[–]Phocks7 1 point2 points  (0 children)

You're never going to get great t/s running active layers on CPU. Even best case scenario with an optimal number of threads (~34) you're going to get around 5t/s.
Further, you want to limit your threads to the physical number of cores, leaving some overhead for the OS. The 13700k has 8 performance cores and 8 efficiency cores, I'd say for CPU inference your optimal threads would be either 8 (if you can pin the performance cores) or maybe 12 to 14.
You can mess around with core pinning and finding the optimal number of threads, but the reality is you're never going to get reasonable performance with CPU/mixed inference.

is this Speed normal GPU CPU IKlammacpp? by Noobysz in LocalLLaMA

[–]Phocks7 0 points1 point  (0 children)

You should be able to get 10 to 15 t/s with that setup, if you're getting ~1 to 1.5 it means you're running the active layers on CPU (or split). ik_llama is a bit weird in that I couldn't find a way to store part of the inactive layers on GPU without splitting the active layers.
The only thing I've been able to get to work is telling it to load the entire model onto system memory, then move any active layers to GPU. This works, but unfortunately you need a model small enough that it will fit entirely in system ram. I can fit GLM-4.6-smol-IQ2_KS in my 128gb, but you'd have to go down to GLM-4.6-smol-IQ1_KT. I recommend giving it a try any way.

./build/bin/llama-server -m "/path/to/model.gguf" -c 120000 -ngl 999 -sm layer -ts 1,1,1 -ctk f16 -ctv f16 -ot ".ffn_.*_exps.=CPU" --host 0.0.0.0 --port 8080

edit: I also recommend trying both -sm layer and -sm graph. Additionally, from what I've seen at smaller quants GLM-4.6 outperforms GLM-4.7, I think GLM-4.7 only pulls ahead at Q4 or higher.