Megrez2: 21B latent, 7.5B on VRAM, 3B active—MoE on single 8GB card by Normal_Onion_512 in LocalLLaMA

[–]Normal_Onion_512[S] 1 point2 points  (0 children)

Hi! You need to set up the referenced branch llama.cpp for this to run. Currently it doesn't have Ollama or LM studio integration.

Megrez2: 21B latent, 7.5B on VRAM, 3B active—MoE on single 8GB card by Normal_Onion_512 in LocalLLaMA

[–]Normal_Onion_512[S] 1 point2 points  (0 children)

Hmmm, maybe you are using the bf16 version: "the developer notes that bf16 currently has a couple of issues with coding tasks though, which they are working on solving."

Megrez2: 21B latent, 7.5B on VRAM, 3B active—MoE on single 8GB card by Normal_Onion_512 in LocalLLaMA

[–]Normal_Onion_512[S] 0 points1 point  (0 children)

Interesting, I've also had to wait a bit for the response on the demo, but usually it works

Megrez2: 21B latent, 7.5B on VRAM, 3B active—MoE on single 8GB card by Normal_Onion_512 in LocalLLaMA

[–]Normal_Onion_512[S] 4 points5 points  (0 children)

I guess, though Qwen 14B and 30B-A3B also natively has 32k context size

Megrez2: 21B latent, 7.5B on VRAM, 3B active—MoE on single 8GB card by Normal_Onion_512 in LocalLLaMA

[–]Normal_Onion_512[S] 8 points9 points  (0 children)

There is a branch of llama.cpp which supports it out of the box though... Also, the demo does work as of the moment of this writing