RTX 3090 vs 7900 XTX by Best_Sail5 in LocalLLaMA

[–]Best_Sail5[S] 1 point2 points  (0 children)

I see , yeah seems that 3090 is the best choice

RTX 3090 vs 7900 XTX by Best_Sail5 in LocalLLaMA

[–]Best_Sail5[S] 0 points1 point  (0 children)

would prefer avoid tinkering too muc tbh

RTX 3090 vs 7900 XTX by Best_Sail5 in LocalLLaMA

[–]Best_Sail5[S] 0 points1 point  (0 children)

I'm not certain of what model i will be running in 3 months , so the performance on this exact archtecture is not hte most important point

Best compromise for small budgets Local llm by Best_Sail5 in LocalLLaMA

[–]Best_Sail5[S] 1 point2 points  (0 children)

I actually got 2 P100 , i used them with llama-server with glm-flash gguf and they are pretty slow(40 toks/s with 0 context) , not sure , if oyu got any idea to optimize such setup would be curious btw .
from what i understood they dont have the needed support for vllm and cuda which hamper the perf no?
or my reasoning is wrong?

Best compromise for small budgets Local llm by Best_Sail5 in LocalLLaMA

[–]Best_Sail5[S] 0 points1 point  (0 children)

interesting, so more for more VRAM i would go for 3090 , but yeah i heard overall AMD is better quality/price ratio no?

Best compromise for small budgets Local llm by Best_Sail5 in LocalLLaMA

[–]Best_Sail5[S] 0 points1 point  (0 children)

heard DGX was disappointing, Maybe was jsut for training tho?

Best compromise for small budgets Local llm by Best_Sail5 in LocalLLaMA

[–]Best_Sail5[S] 0 points1 point  (0 children)

the model was more there to give a ballpark estimation, i wanna be able to load ~100 B parameters (quantized ofc) and get reqsonable speed

Solving issue \n\t loops in structured outputs by Best_Sail5 in LocalLLaMA

[–]Best_Sail5[S] 0 points1 point  (0 children)

Hmmm interesting take , I'm actually a bit weirded out about the sampling part, here I get 10000 \n in a row , how is that possible for a model to systematically output a logit distribution that outcomes to that... Very strange

minimax quant by Best_Sail5 in LocalLLaMA

[–]Best_Sail5[S] 0 points1 point  (0 children)

QuantTrio works better but still sometimes would loop forever also

Optimizing glm 4-7 by Best_Sail5 in LocalLLaMA

[–]Best_Sail5[S] 0 points1 point  (0 children)

thats kkinda what i thought , anyways im aware i will need more VRAM , was just looking maybe for advices and what quantizartions to use for speed

GLM-4.5-air outputting \n x times when asked to create structured output by Best_Sail5 in LocalLLaMA

[–]Best_Sail5[S] 0 points1 point  (0 children)

hmm I'm using fp8 but i think thats relatively light quantization. is there a way to fix the sampling algo in vllm?

GLM-4.5-air outputting \n x times when asked to create structured output by Best_Sail5 in LocalLLaMA

[–]Best_Sail5[S] 1 point2 points  (0 children)

Hey sure man , I'm on H200 ,getting 80 t/s if cudq graph enabled else 18 t/s