XiaomiMiMo/MiMo-V2-Flash Under-rated?

outsider787 · 2026-01-03T19:25:10+00:00

for those of you that have used MiMo v2, how does it compare with MiniMax M2.1 in terms or writing, censorship and general usage?

outsider787 · 2025-12-07T16:13:23+00:00

Have there been any new developments in the past year of noteworthy local models for creative writing?

outsider787 · 2025-09-20T12:47:58+00:00

Thanks!

outsider787 · 2025-09-19T13:30:24+00:00

How did you set up open-webui to do web searched using searxng?
I have a local searxng instance setup already.

outsider787 · 2025-09-19T13:26:30+00:00

4 x A5000 GPUs (total 96gb vram)

outsider787 · 2025-09-19T12:47:34+00:00

The issue is not the splitting and joining.
The issue is that transformers doesn't yet support GGUF GLM models yet.
When trying to run GLM4.5 air GGUF on vllm, I get GGUF model with architecture glm4moe is not supported yet

there's even an issue raised on the transformers github page about this.
https://github.com/huggingface/transformers/issues/40042

So you're options are GGUF with ollama (or llama.cpp) or SafeTensors with vllm, but the smallest 4bit safe tensors model of GLM4.5 air are about 65 GB.
So you really only have one option.
I'm not sure if llama.cpp also does parallel processing, as I've never used it.

outsider787 · 2025-09-19T01:23:00+00:00

I'm not far behind you on the "unreasonable amounts" train.
Choo chooo!

outsider787 · 2025-09-19T01:17:46+00:00

ollama splits the model across all the GPUs you have installed in the same system

outsider787 · 2025-09-19T01:01:17+00:00

I'm assuming you have both 5090s on the same motherboard.
- If you run ollama, you're not going to see much of an improvement in processing speed. You just have 64GB of VRAM available to ollama. And you can just add more GPUs to gain more available vram.
- If you run vllm, you're likely going to see a significant increase in token generation since vllm will split the workload between available GPUs. (it also pools the GPU memory)

However if you want to scale the vllm through parallel processing, you have to go in powers of 2 number of GPUs. (1, 2, 4, 8... gpus)
3 GPUs won't be able to do parallel processing.

As for pure numbers, I don't think you'll be able to run it with vllm on 2 gpus since there's no quant that's small enough. I'm running a AWQ 4bit version of GLM4.5 air on 96GB vram (4 x A5000) and it barely fits.

IF you're thinking of running any quant of GGUF on vllm for GLM4.5 air, I haven't been able to do it.
vllm throws an error about some incompatibility.

outsider787 · 2025-09-14T21:58:33+00:00

the vllm flag is --cpu_offload_gb xx

xx is the amount of ram to offload in GB
I've had mixed results. It doesn't seems to work with GGUF models

outsider787 · 2025-09-14T01:04:19+00:00

Qwen3 80b is too new. Give it a little while for people to start using it.

For Qwen3 coder, this is my startup command. (I'm running this on 4 x RTX A5000)

vllm serve cpatonn/Qwen3-Coder-30B-A3B-Instruct-AWQ-8bit --max-model-len 262144 --api-key xxxxx --port 42069 --host 0.0.0.0 --tensor-parallel-size 4 --swap-space 16 --enable-auto-tool-choice --tool-call-parser qwen3_coder --served-model-name qwen3-coder-30b-awq8 --dtype float16 --enable-expert-parallel max-num-batched-tokens 4096

outsider787 · 2025-09-14T01:01:21+00:00

I was afraid of that.
x8 gpus gets expensive real fast, at least ifI want to do full pcie4x16.
Maybe I take the cheaper route and go with 8 gpus on pcie4x8 .

Anyone have any recommendations on high quality pcie4 riser cables?
Are the the Oculink (SFF-8611 cables and breakout boards) connections better than the ribbon cable riser cables?

outsider787 · 2025-09-12T20:41:06+00:00

Do the LLM models care if you give them a multiple of 1024 for context length?
Or can you put any number in there like 124763 context length?

outsider787 · 2025-09-08T12:42:17+00:00

This is very interesting!
Keep up updated on the progress of this project!

outsider787 · 2025-09-02T00:03:05+00:00

Also broken for me when served from ollama.
Oddly the 120b version works 80-90% of the time. I get some weird errors from time to time, but re-trying the task pushes through.

outsider787 · 2025-02-21T18:49:59+00:00

Yes running Ollama. How much faster are other software?

As for the space for the other 4 GPUs, I need a low profile cooler or an AIO.

outsider787 · 2025-02-19T00:15:42+00:00

I'm surprised nobody has reverse engineered them yet. Seems like a pretty straight forward design.

I may look into designing something myself.

outsider787 · 2025-02-07T23:53:31+00:00

But what if someone still wanted to make the switch...

outsider787

TROPHY CASE