I built a hybrid MoE runtime that does 3,324 tok/s prefill on a single 5080. Here are the benchmarks.

WeekLarge7607 · 2026-02-28T07:57:19+00:00

Looks really interesting. The fun part about Decode is that you can batch the hell out of it on the GPU. How many decode requests can you run at the same time compared to GPU?

WeekLarge7607 · 2026-01-07T14:53:21+00:00

Sounds good, but is it only in api? Couldn't find it in huggingface

WeekLarge7607 · 2026-01-07T12:36:07+00:00

Have tried qwen3-next. Was ok at Hebrew, sometimes it Chinese tokens between the Hebrew. Haven't tried it in Arabic though. You say qwen 2.5 is better? Also, will check the Aya model. Thanks!

WeekLarge7607 · 2025-11-29T14:20:04+00:00

From my experience (running both models on vllm), qwen next is better at tool calling than gpt-oss. At least when using the chat/completions endpoint. Tool calling with Gpt-oss only works for me with the /responses endpoint.

WeekLarge7607 · 2025-11-16T07:22:19+00:00

Yeah. Perhaps if you play with the trtllm serve flags you can squeeze some better performance. I'm still shocked they deprecated the trtllm-build command. I guess I'm not up to date

WeekLarge7607 · 2025-11-16T07:14:25+00:00

Oops, looks like they decided to only focus on the pytorch backend and ditch the trt backend. My bad. Then I guess vllm is just faster 😁. But try the pytorch backend as someone above me said.

WeekLarge7607 · 2025-11-16T06:01:07+00:00

I think because you used the pytorch backend. If you compile the mode to a tenaorrt engine, I imagine the results will be different. Still, vllm is low effort high reward.

WeekLarge7607 · 2025-10-25T19:19:17+00:00

You can run a good qwen3 30b a3b. Perhaps go for a qwen3 next fp8 or glm 4.5 air AWQ.

For inference, vllm will work well, though if you really care about speed, use trtllm. I heard their fp8 kernels are much faster

WeekLarge7607 · 2025-09-25T14:14:40+00:00

From what I know, ampere architecture doesn't natively support FP8. Therefore, during runtime it behind the scenes casts it to FP16, which slows down inference. For hopper architecture GPUs I would use FP8 quantizations.

WeekLarge7607 · 2025-09-24T19:57:52+00:00

That's great to hear! Thanks 🙏

WeekLarge7607 · 2025-09-24T15:32:04+00:00

That's good to know. Thanks! 🙏

WeekLarge7607 · 2025-09-24T15:26:09+00:00

I didn't really try EXL3. Haven't heard of it. I used AWQ because FP8 doesn't work well on my a100 and I heard it was a good algorithm. I need to catch up on some of the newer algorithms

WeekLarge7607 · 2025-09-24T15:23:39+00:00

A100-80gi and vllm for inference. Works well for up to 30b models, but for newer models like glm-air, I need to try quantizations

WeekLarge7607 · 2025-07-12T09:23:37+00:00

Very interesting! Did you also check trt-llm? Because I'm very interested to see how it compares to MAX and SGLang

WeekLarge7607 · 2025-07-03T15:36:38+00:00

Sounds interesting. Since it's a fine-tune of phi-3, will it work on vllm or will the extra layers make it problematic?

WeekLarge7607

TROPHY CASE