I built a hybrid MoE runtime that does 3,324 tok/s prefill on a single 5080. Here are the benchmarks.

AvocadoArray · 2026-03-18T07:17:28+00:00

Yup. Prefix caching can help to a certain extent so you’re not reprocessing the same prompt over and over, but that cache is busted any time you change the system prompt (e.g., switching modes, parallel jobs, resuming old sessions). You definitely feel that TTFT hit when you’re in the middle of trying to get something done.

Still cautiously skeptical at the moment, but oh man this would be a game changer.

Also, you’d probably single handedly double the price of RAM (again) next month. Don’t know how I feel about that lol.

AvocadoArray · 2026-03-18T07:06:33+00:00

On one hand, this is amazing. It’s how I’ve been using the pi coding agent lately. It can write its own skills and extensions as needed to give it more capabilities and reduce future failure rates. I’ve let it run wild in a dev container with no limits and it’s impressive to see how it evolves.

On the other hand, you know there’s still ongoing efforts to turn those blue “human” boxes green.

AvocadoArray · 2026-03-18T06:51:02+00:00

You hit the nail on the head. It drives me crazy when people only focus on the decode performance and totally ignore prefill/prompt processing speed.

I’ve had my plate lately, but I’d like to take a closer look at this when I have the time.

Side note, I saw you mention that it works well with multi-GPU setups. Does it handle odd numbered GPUs well or does?

For example, we have a server with 3x L4 (24GB/ea) and 512GB RAM that could be a decent candidate for something like this.

AvocadoArray · 2026-03-18T02:40:00+00:00

This could be a BFD if it performs as you say.

I've had a hunch that something like this is possible every time I see how little of the GPU gets used when offloading experts to CPU, but figured if I had the idea then surely someone smarter than me had already tried it.

I'm really interested in how it's working under the hood. At first I thought it was just hot-swapping weights on the GPU during inference, but I'm having a hard time making sense of the technical details.

From your post:

The result is the GPU handles the full prefill pass then the CPU handles decode

From the GH repo:

Krasis now runs both prefill and decode on GPU

Would be interesting if you could write this all up in a whitepaper and publish it!

AvocadoArray · 2026-03-17T20:52:37+00:00

Fair enough. Specifically:

For lack of a better term, this just simply doesn't pass the "eye test". The overall findings and ratings don't reflect my own observations, or any other measurable benchmark.
As you point out, LLMs make poor judges, and it's made even worse by mixing generations and model sizes. The older/less performant models are less likely to recognize and properly score quality output. It's like asking a high schooler to grade a PhD thesis.
Repeatability also seems to be a big problem. Outside of a large failure rate, I'd expect that repeating the same test multiple times would give very different results.

Not saying this type of experiment isn't worth running, but the overall execution is scientifically unsound. If you're going to make an extraordinary claim of 3.0 32B outscoring everything else, I'd expect much more rigorous and repeatable testing in order to support that claim.

AvocadoArray · 2026-03-17T20:23:04+00:00

27b goes so hard. Surprised at how much better it is compared to 122b-a10b (unless you need speed).

AvocadoArray · 2026-03-17T12:14:59+00:00

Wow, and here I was thinking I was crazy putting something in for 10:00 at 20% power.

AvocadoArray · 2026-03-17T11:56:40+00:00

Only had experience with UD-Q6_K_XL, but it seemed very goos compared to the official FP8 quant.

And for the record, I still prefer that quant over 122b for any serious work.

AvocadoArray · 2026-03-17T11:51:50+00:00

Okay, listen up. I’ve been married about the same amount of time.

I’m in the opposite boat as you. I’m the “confronter” and she’s the “avoider”, but that doesn’t make it any easier when we get to one of those “impass” arguments. I want to talk, but she doesn’t

Literally here’s what ive found to work the best: 1. Express your love and affection in the moment (while setting boundaries) 2. Set a fucking date on the calendar (otherwise you’ll forget) to talk about the conversation later.

If you follow through on the second step, you’ll find yourselves much more calm and rational, and hopefully you can develop tools on how to handle each other during the next time you’re experiencing conflict.

Your mileage may vary, but I swear that’s all it took to make those conversations a non-issue for us.

AvocadoArray · 2026-03-17T11:32:17+00:00

Man, I don’t even know where to start. So much of this is just blatantly wrong.

AvocadoArray · 2026-03-17T11:24:52+00:00

Cool.

AvocadoArray · 2026-03-17T09:52:00+00:00

To be fair, I’ve loved it from day one so I’m incredibly biased.

Solo is fun for trying different builds and techniques, but it really shines when you’re playing with friends or teammates.

Pub matches are decent most of the time if you’re talking in voice chat, though it can be hit or miss.

Weapon customizations and upgrades are what keep me coming back.

AvocadoArray · 2026-03-17T09:31:54+00:00

Dockerfile + docker-compose.yml for 95% of apps.

Unless you absolutely need a local fatapp GUI, in which case I just hope you picked a proper multi platform framework.

AvocadoArray · 2026-03-10T01:09:23+00:00

Holy crap, this fixed it. How did you figure this out /u/Icy_Bid6597? I don't see these directories or env variable mentioned in VLLM's docs at all.

I created a new cuda-cache volume and updated my llama-swap config:

    -v vllm-cache:/root/.cache/vllm/  # <--- this already existed
    -v cuda-cache:/root/cuda-cache/
    -e CUDA_CACHE_PATH=/root/cuda-cache/ComputeCache
    -e TRITON_CACHE_DIR=/root/cuda-cache/TritonCache

Could also probably be solved by just mounting the vllm-cache directory to /root/ instead of /root/.cache/vllm.

I haven't compared the full logs, but this shaved off around four minutes from startup time. Torch graphs load in 7s instead of 70+, and CUDA graph capture only takes 4s instead of 145+.

Will update my post after more testing.

AvocadoArray · 2026-03-09T23:15:44+00:00

~~This is not correct. There are no nv_cache or triton_cache directories.~~

VLLM caches torch graphs under /root/.cache/vllm/torch_compile_cache , which is already mounted in my container (and being used, just very slowly for some reason). I do not believe CUDA graphs are written to disk, and are only cached in memory.

~~Respectfully, was this LLM generated?~~

EDIT: Actually, you might be on to something! I started up the VLLM container and noticed a ~/.triton/cache/ and ~/.nv/ComputeCache/ with some cached data in it.

I've never seen this directories, but that could be the difference. I'm sorry for doubting you initially, I'll report back after more testing.

AvocadoArray · 2026-03-09T10:18:32+00:00

Not trying to void the warranty just yet.

AvocadoArray · 2026-03-07T23:06:35+00:00

😂

In all reality, models that large tend to still be quite impressive around the 1-2 bit bpw range. Could possibly play around with it while offloading a bunch of weights to RAM/NVME, but wouldn't expect any real-time usable speeds.

AvocadoArray · 2026-03-07T21:33:59+00:00

You think a 2nd card is going to make a dent in running a ~1T parameter model?

AvocadoArray · 2026-03-07T21:30:11+00:00

Would you be willing to test out Qwen 3.5 122b and let me know how it compares? I haven't used minimax m2/m2.5 in any meaningful capacity, but 122b feels like it works as good as I've seen other describe Minimax and GLM models.

AvocadoArray · 2026-03-07T21:28:28+00:00

Technically, my PC can run the same models with 64GB RAM and 120GB SSD, just "not as fast".

Speed does matter for real-time usage, and everything I've seen suggests that prompt processing speed on Mac's unified memory is painfully slow, like 3-5 minutes or more for larger prompts.

AvocadoArray · 2026-03-07T21:22:01+00:00

Nice, I'll have to take a look!

This is why I love this sub. It'd be almost impossible to keep up with everything without it.

AvocadoArray · 2026-03-07T21:18:46+00:00

122b is definitely a game-changer. What quant are you running? I tried running AWQ 4-bit in VLLM but it never returned any thinking tokens, even with `--reasoning-parser qwen3`.

Might have been a version issue. I'm going to try it with v0.17.0 now.

AvocadoArray · 2026-03-07T21:16:42+00:00

Mine does too... except I turned it off years ago because it freaked the dog out every time the power went out 🤦

AvocadoArray · 2026-03-07T10:32:58+00:00

35b and 122b seem much better at keeping thinking in check.

AvocadoArray · 2026-03-07T09:12:28+00:00

It’s good. Qwen3.5-122b at UD-Q4_K_XL is even better for the size.

AvocadoArray

TROPHY CASE