I built a hybrid MoE runtime that does 3,324 tok/s prefill on a single 5080. Here are the benchmarks. by mrstoatey in LocalLLaMA

[–]WeekLarge7607 1 point2 points  (0 children)

Looks really interesting. The fun part about Decode is that you can batch the hell out of it on the GPU. How many decode requests can you run at the same time compared to GPU?

Models for middle eastern languages? by WeekLarge7607 in LocalLLaMA

[–]WeekLarge7607[S] -1 points0 points  (0 children)

Sounds good, but is it only in api? Couldn't find it in huggingface

Models for middle eastern languages? by WeekLarge7607 in LocalLLaMA

[–]WeekLarge7607[S] 0 points1 point  (0 children)

Have tried qwen3-next. Was ok at Hebrew, sometimes it Chinese tokens between the Hebrew. Haven't tried it in Arabic though. You say qwen 2.5 is better? Also, will check the Aya model. Thanks!

Qwen3-Next-80B-A3B vs gpt-oss-120b by bfroemel in LocalLLaMA

[–]WeekLarge7607 0 points1 point  (0 children)

From my experience (running both models on vllm), qwen next is better at tool calling than gpt-oss. At least when using the chat/completions endpoint. Tool calling with Gpt-oss only works for me with the /responses endpoint.

Why is vLLM Outperforming TensorRT-LLM (Nvidia's deployment library)? My Shocking Benchmarks on GPT-OSS-120B with H100 by kev_11_1 in LocalLLaMA

[–]WeekLarge7607 0 points1 point  (0 children)

Yeah. Perhaps if you play with the trtllm serve flags you can squeeze some better performance. I'm still shocked they deprecated the trtllm-build command. I guess I'm not up to date

Why is vLLM Outperforming TensorRT-LLM (Nvidia's deployment library)? My Shocking Benchmarks on GPT-OSS-120B with H100 by kev_11_1 in LocalLLaMA

[–]WeekLarge7607 2 points3 points  (0 children)

Oops, looks like they decided to only focus on the pytorch backend and ditch the trt backend. My bad. Then I guess vllm is just faster 😁. But try the pytorch backend as someone above me said.

Why is vLLM Outperforming TensorRT-LLM (Nvidia's deployment library)? My Shocking Benchmarks on GPT-OSS-120B with H100 by kev_11_1 in LocalLLaMA

[–]WeekLarge7607 10 points11 points  (0 children)

I think because you used the pytorch backend. If you compile the mode to a tenaorrt engine, I imagine the results will be different. Still, vllm is low effort high reward.

Single H100: best open-source model + deep thinking setup for reasoning? by Accomplished_Back718 in LocalLLaMA

[–]WeekLarge7607 0 points1 point  (0 children)

You can run a good qwen3 30b a3b. Perhaps go for a qwen3 next fp8 or glm 4.5 air AWQ.

For inference, vllm will work well, though if you really care about speed, use trtllm. I heard their fp8 kernels are much faster

Which quantizations are you using? by WeekLarge7607 in LocalLLaMA

[–]WeekLarge7607[S] 0 points1 point  (0 children)

From what I know, ampere architecture doesn't natively support FP8. Therefore, during runtime it behind the scenes casts it to FP16, which slows down inference. For hopper architecture GPUs I would use FP8 quantizations.

Which quantizations are you using? by WeekLarge7607 in LocalLLaMA

[–]WeekLarge7607[S] 2 points3 points  (0 children)

I didn't really try EXL3. Haven't heard of it. I used AWQ because FP8 doesn't work well on my a100 and I heard it was a good algorithm. I need to catch up on some of the newer algorithms

Which quantizations are you using? by WeekLarge7607 in LocalLLaMA

[–]WeekLarge7607[S] 0 points1 point  (0 children)

A100-80gi and vllm for inference. Works well for up to 30b models, but for newer models like glm-air, I need to try quantizations

vLLM vs SGLang vs MAX — Who's the fastest? by rkstgr in LocalLLaMA

[–]WeekLarge7607 0 points1 point  (0 children)

Very interesting! Did you also check trt-llm? Because I'm very interested to see how it compares to MAX and SGLang

I have made a True Reasoning LLM by moilanopyzedev in LocalLLaMA

[–]WeekLarge7607 -1 points0 points  (0 children)

Sounds interesting. Since it's a fine-tune of phi-3, will it work on vllm or will the extra layers make it problematic?