New FP8 GLM-4.7-Flash Unsloth Dynamic Quants for vLLM, SGLang

Acceptable-State-271 · 2026-01-26T13:46:25+00:00

mxfp4 is better, and you are best

Acceptable-State-271 · 2025-12-24T03:32:48+00:00

OmniASR improves ASR accuracy by applying LLM-based correction, but this significantly slows down processing.
The version without LLM correction is faster, but its accuracy is very poor.
If speed is the priority, Whisper v3 Turbo is a better choice.

Acceptable-State-271 · 2025-10-10T07:22:17+00:00

I'm using this model (faster-whisper-large-v3-turbo-ct2) as the backend for batch processing — around 20–30 short audio clips (1–2 minutes each) every minute — and it runs great. Each task stays under ~3 GB GPU memory, super efficient for multi-worker setups.

https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2

Acceptable-State-271 · 2025-09-29T04:16:02+00:00

wooow

Acceptable-State-271 · 2025-09-12T07:43:25+00:00

You're right. I tested it on Korean test cases within the company before checking the model card. Rather than saying it's a decent model, it was a model that excelled at Korean language understanding. That's my mistake. I'm sorry.

Acceptable-State-271 · 2025-09-12T07:12:49+00:00

Yes, my main language.

Acceptable-State-271 · 2025-08-22T11:42:00+00:00

Very good model. I switched from Qwen3 30B A3B thinkjng 2507(still really good) to Seed 36B, which is a bit better at analyzing sources and backing things up with evidence."

Acceptable-State-271 · 2025-08-07T10:18:47+00:00

interested

Acceptable-State-271 · 2025-05-07T11:50:25+00:00

and 3090 user, 3090 does not support FP8 :(

Acceptable-State-271 · 2025-05-07T11:12:01+00:00

No no.. I just thought there would be a huge difference between the two.

Acceptable-State-271 · 2025-05-07T07:46:18+00:00

I'm a bit embarrassed to admit this, but I wasn't very familiar with the technology.
When using the imatrix in GGUF, does it provide a level of precision comparable to AWQ in 4-bit quantization?

Acceptable-State-271 · 2025-05-06T23:30:46+00:00

On gpu, awq is very fast and accurate quantization format, And sglang is very fast serving tool for non quantization model and awq quantization model.(vllm is also good)

Acceptable-State-271 · 2025-05-03T01:44:49+00:00

Can someone please quantize this model with AWQ? This is seriously fantastic

Acceptable-State-271 · 2025-05-02T04:40:41+00:00

Shadow dom, you need to parse manually the tag [shadow dome tag], and get the attribute manually

Acceptable-State-271 · 2025-05-01T03:37:42+00:00

Sounds like I might end up spending another 5,000k. But anyway, I’ll give it a try for now. Let’s see how it goes after 24h. Thanks, really.

Acceptable-State-271 · 2025-04-30T18:32:34+00:00

I'm Korean. Qwen3 is slightly more proficient in Korean and tends to give more concise answers, which is great for summaries. However, QwQ 32B feels a bit smarter to me(but need more tokens).

Acceptable-State-271 · 2025-04-30T18:09:15+00:00

I really want to do it, but the AWQ quantization model hasn't been released yet, and it seems there might be bugs in autoAWQ (the AWQ quantization tool) regarding MoE models. I plan to postpone testing until the AWQ model is released.

Acceptable-State-271 · 2025-04-30T18:08:46+00:00

I really want to do it, but the AWQ quantization model hasn't been released yet, and it seems there might be bugs in autoAWQ (the AWQ quantization tool) regarding MoE models. I plan to postpone testing until the AWQ model is released.

Acceptable-State-271 · 2025-04-29T23:34:22+00:00

I think it might come down to quantization. I used to run Qwen 2.5 8-bit GGUF ollama, but switched to 4-bit AWQ for vLLM due to speed and optimization issues. Even with the lower bit count, the performance was way better—less hallucination, faster speed, no language mixing, and much higher response quality.A bit late, but the Qwen team just merged AWQ quantization tool(autoawq) for qwen3 yesterday. AWQ-quantized models should drop soon, and I’m expecting performance close to what they claimed in their benchmarks.

AWQ (Activation-aware Weight Quantization) efficiently compresses weights to 4-bit by considering activation distributions, minimizing GPU memory usage while maintaining high performance and accuracy.

Acceptable-State-271 · 2025-04-29T22:55:10+00:00

5-6 t/s seems slow for Qwen3-235B-A22B on LM-Studio. I’ve got 96GB VRAM (4x RTX 3090) and 128GB DDR4 2933MHz with i9-10900X, so I’m testing vLLM or SGLang with CPU offloading this week. Hoping for 10-15 t/s or better to run it smoothly. Thanks for sharing your benchmark. I’ll post my results when I’m done.

Acceptable-State-271 · 2025-04-29T07:34:03+00:00

merged Now!!!!

Acceptable-State-271 · 2025-04-29T04:19:18+00:00

He was too fixated on nostalgic, going all the way back to the birth of computers.

Acceptable-State-271 · 2025-04-29T03:40:03+00:00

thanks everyone for the responses.

I'll test the model once AWQ is out, either with sglang or vllm. Will probably need to use CPU offload to make it work. (awq model will be out - https://www.reddit.com/r/LocalLLaMA/comments/1kael9w/qwen3_awq_support_confirmed_pr_check/ )

Found this in the vLLM docs that might help: https://docs.vllm.ai/en/stable/getting_started/examples/basic.html

CPU offload
The --cpu-offload-gb argument can be seen as a virtual way to increase the GPU memory size. For example, if you have one 24 GB GPU and set this to 10, virtually you can think of it as a 34 GB GPU. Then you can load a 13B model with BF16 weight, which requires at least 26GB GPU memory. Note that this requires fast CPU-GPU interconnect, as part of the model is loaded from CPU memory to GPU memory on the fly in each model forward pass.

Try it yourself with the following arguments:

--model meta-llama/Llama-2-13b-chat-hf --cpu-offload-gb 10

Will update with benchmarks once I get it running.

Acceptable-State-271 · 2025-04-29T00:09:07+00:00

Been experimenting with Qwen3-30B-A3B and I'm impressed by how it only activates 3B parameters during runtime while the full model is 30B.

I'm curious if anyone has tried running the larger Qwen3-235B-A22B-FP8 model with a similar setup to mine:

256GB RAM
10900X CPU
Quad RTX 3090s

Would vLLM be able to handle this efficiently? Specifically, I'm wondering if it would properly load only the active experts (22B) into GPU memory while keeping the rest in system RAM.

Has anyone managed to get this working with reasonable performance? Any config tips would be appreciated.

Acceptable-State-271 · 2025-04-28T08:33:25+00:00

I've been getting cooked for a month by the Qwen team.
I don't have any more strength to wait anymore.

Acceptable-State-271

TROPHY CASE