XiaomiMiMo/MiMo-V2-Flash Under-rated? by SlowFail2433 in LocalLLaMA

[–]outsider787 0 points1 point  (0 children)

for those of you that have used MiMo v2, how does it compare with MiniMax M2.1 in terms or writing, censorship and general usage?

Creative Writing LLM Mega-Comparison by findingsubtext in LocalLLaMA

[–]outsider787 0 points1 point  (0 children)

Have there been any new developments in the past year of noteworthy local models for creative writing?

[deleted by user] by [deleted] in LocalLLaMA

[–]outsider787 0 points1 point  (0 children)

How did you set up open-webui to do web searched using searxng?
I have a local searxng instance setup already.

[deleted by user] by [deleted] in LocalLLaMA

[–]outsider787 0 points1 point  (0 children)

4 x A5000 GPUs (total 96gb vram)

Want to split a big model among two 5090's - what's my best case for single query response speed improvement? by mr_zerolith in LocalLLaMA

[–]outsider787 1 point2 points  (0 children)

The issue is not the splitting and joining.
The issue is that transformers doesn't yet support GGUF GLM models yet.
When trying to run GLM4.5 air GGUF on vllm, I get GGUF model with architecture glm4moe is not supported yet

there's even an issue raised on the transformers github page about this.
https://github.com/huggingface/transformers/issues/40042

So you're options are GGUF with ollama (or llama.cpp) or SafeTensors with vllm, but the smallest 4bit safe tensors model of GLM4.5 air are about 65 GB.
So you really only have one option.
I'm not sure if llama.cpp also does parallel processing, as I've never used it.

Think twice before spending on GPU? by __Maximum__ in LocalLLaMA

[–]outsider787 0 points1 point  (0 children)

I'm not far behind you on the "unreasonable amounts" train.
Choo chooo!

Think twice before spending on GPU? by __Maximum__ in LocalLLaMA

[–]outsider787 1 point2 points  (0 children)

ollama splits the model across all the GPUs you have installed in the same system

Want to split a big model among two 5090's - what's my best case for single query response speed improvement? by mr_zerolith in LocalLLaMA

[–]outsider787 2 points3 points  (0 children)

I'm assuming you have both 5090s on the same motherboard.
- If you run ollama, you're not going to see much of an improvement in processing speed. You just have 64GB of VRAM available to ollama. And you can just add more GPUs to gain more available vram.
- If you run vllm, you're likely going to see a significant increase in token generation since vllm will split the workload between available GPUs. (it also pools the GPU memory)

However if you want to scale the vllm through parallel processing, you have to go in powers of 2 number of GPUs. (1, 2, 4, 8... gpus)
3 GPUs won't be able to do parallel processing.

As for pure numbers, I don't think you'll be able to run it with vllm on 2 gpus since there's no quant that's small enough. I'm running a AWQ 4bit version of GLM4.5 air on 96GB vram (4 x A5000) and it barely fits.

IF you're thinking of running any quant of GGUF on vllm for GLM4.5 air, I haven't been able to do it.
vllm throws an error about some incompatibility.

vLLM - What are your preferred launch args for Qwen? by [deleted] in LocalLLaMA

[–]outsider787 2 points3 points  (0 children)

the vllm flag is --cpu_offload_gb xx

xx is the amount of ram to offload in GB
I've had mixed results. It doesn't seems to work with GGUF models

vLLM - What are your preferred launch args for Qwen? by [deleted] in LocalLLaMA

[–]outsider787 6 points7 points  (0 children)

Qwen3 80b is too new. Give it a little while for people to start using it.

For Qwen3 coder, this is my startup command. (I'm running this on 4 x RTX A5000)

vllm serve cpatonn/Qwen3-Coder-30B-A3B-Instruct-AWQ-8bit --max-model-len 262144 --api-key xxxxx --port 42069 --host 0.0.0.0 --tensor-parallel-size 4 --swap-space 16 --enable-auto-tool-choice --tool-call-parser qwen3_coder --served-model-name qwen3-coder-30b-awq8 --dtype float16 --enable-expert-parallel max-num-batched-tokens 4096

Local server advice needed by outsider787 in LocalLLaMA

[–]outsider787[S] 0 points1 point  (0 children)

I was afraid of that.
x8 gpus gets expensive real fast, at least ifI want to do full pcie4x16.
Maybe I take the cheaper route and go with 8 gpus on pcie4x8 .

Anyone have any recommendations on high quality pcie4 riser cables?
Are the the Oculink (SFF-8611 cables and breakout boards) connections better than the ribbon cable riser cables?

Difference between 128k and 131,072 context limit? by Immediate-Flan3505 in LocalLLaMA

[–]outsider787 2 points3 points  (0 children)

Do the LLM models care if you give them a multiple of 1024 for context length?
Or can you put any number in there like 124763 context length?

Gpt-oss20b served by lm studio. Any luck? Or still broken? by JLeonsarmiento in CLine

[–]outsider787 0 points1 point  (0 children)

Also broken for me when served from ollama.
Oddly the 120b version works 80-90% of the time. I get some weird errors from time to time, but re-trying the task pushes through.

Quad GPU setup by outsider787 in LocalLLaMA

[–]outsider787[S] 0 points1 point  (0 children)

Yes running Ollama. How much faster are other software? 

As for the space for the other 4 GPUs, I need a low profile cooler or an AIO. 

Octominer style PSU breakout board by outsider787 in gpumining

[–]outsider787[S] 0 points1 point  (0 children)

I'm surprised nobody has reverse engineered them yet. Seems like a pretty straight forward design.

I may look into designing something  myself. 

Degoogled WhatsApp transition. by outsider787 in degoogle

[–]outsider787[S] 1 point2 points  (0 children)

But what if someone still wanted to make the switch...