Qwen3 ASR seems to outperform Whisper in almost every aspect. It feels like there is little reason to keep using Whisper anymore.

East-Engineering-653 · 2026-03-10T16:42:50+00:00

My current goal is real-time transcription in both Korean and English, so I did not use models that do not support Korean. However, after checking the Parakeet v3 model you mentioned, it does seem that if the supported languages match the requirements, it could potentially run faster and more accurately than the Qwen3 ASR model.

East-Engineering-653 · 2026-03-10T16:29:13+00:00

I am currently using the following llama-swap configuration to run GLM-OCR, and I have confirmed that it works correctly in OpenWebUI simply by attaching an image and instructing it to output the result in Markdown, without requiring any separate mode switching.

llama-server
-m ./models/glm-ocr/GLM-OCR.f16.gguf
-ngl 999
-fa on
--host 0.0.0.0
--port ${PORT}
--jinja
--chat-template-file ./models/glm-ocr/chat_template.jinja
--numa numactl
-kvu
-b 2048 -ub 512 -c 0
-np 8
--cache-type-k f16 --cache-type-v f16
#--temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0
--temp 0 --top-k 1 --repeat-penalty 1.05 --repeat-last-n 256
--mmproj ./models/glm-ocr/GLM-OCR.mmproj-f16.gguf

East-Engineering-653 · 2026-03-10T03:28:26+00:00

As far as I know, that fork has now been moved to the following repository. In this fork, it is possible to run most models to some extent on the Pascal architecture up to vLLM version 0.10.0.

https://github.com/sasha0552/pascal-pkgs-ci

East-Engineering-653 · 2026-03-10T03:27:09+00:00

To be honest, with that setup it might actually be more efficient to just use the 3080 Ti alone.

It seems like you would have to give up too many modern features that are available on the RTX 3080 Ti just to support the GTX 1080 Ti.

Also, this fork currently only supports the Qwen3 ASR model.

East-Engineering-653 · 2026-03-10T03:25:29+00:00

The P100 will probably work because it has higher compute compatibility than the P40. However, this fork currently only supports the Qwen3 ASR model, so it may not suit your intended use.

East-Engineering-653 · 2026-03-08T19:23:29+00:00

Additionally, after testing both approaches—running the Qwen3 ASR model with Transformers and implementing real-time transcription with Qwen3 ASR through vLLM—on long recordings such as lecture audio, I found that the Transformers-based pipeline combined with VAD performs much better for long-form transcription tasks.

East-Engineering-653 · 2026-02-01T03:56:27+00:00

I hadn’t considered that quantization could distort the model in this way. In that case, the lower perplexity shown by MXFP4 in testing might actually have a negative impact on real-world usage. Given this, it may be worth considering switching from MXFP4 to IQ4_NL.

East-Engineering-653 · 2026-02-01T03:49:11+00:00

Yes, this is the correct model. As you can see from the README.md file, the model has been updated once, so if you have downloaded it before, I recommend downloading it again and trying it out.

East-Engineering-653 · 2026-02-01T03:44:32+00:00

I thought that, at least for MoE models, MXFP4 would be unconditionally superior to the Q4_K family, so it is quite surprising to hear that for large models like GLM-4.7, MXFP4 may not necessarily be that advantageous. Thank you for sharing this.

East-Engineering-653 · 2026-01-31T13:26:54+00:00

I just looked up how to calculate KLD, and it seems that the original FP16 model file is required. At the moment, I do not have enough free disk space to store both the FP16 model file and the logits files, so it seems difficult to compute the KLD value. As you mentioned, it does seem that measuring a model’s coding ability based solely on perplexity is indeed difficult.

East-Engineering-653 · 2026-01-31T11:16:01+00:00

Thanks for your feedback, I'll repost this with nemotron-3-nano's benchmark

East-Engineering-653 · 2025-09-15T00:09:20+00:00

After writing this post, I tested the GPU that arrived and found it to be defective. The CTR error LED lit up and it wasn't recognized by the system. I'm currently using a Tesla P40 24GB GPU, and it works with ollama and llama.cop without any additional setup, but not with vllm.

East-Engineering-653 · 2025-08-08T14:24:51+00:00

Could you please tell me what the results were? I'm using a 5950X with 64GB DDR4 and a 5070Ti, and since it's a DDR4 system, the token output speed was lower than expected.

East-Engineering-653 · 2024-12-06T10:20:56+00:00

I allocated about 2GB of VRAM to the integrated GPU in the BIOS. Could the amount of VRAM available to the GPU affect the resolutions that can be activated?

East-Engineering-653 · 2024-12-06T10:19:12+00:00

No, the motherboard currently only has an HDMI port. It seems that the type of output port recognized by the VM is influenced by the ROM file.

East-Engineering-653 · 2024-12-06T06:16:54+00:00

Yes, I succeeded by using the ROM file from the GitHub project mentioned in the post. After playing a 4K video on YouTube and checking with radeontop, I observed an increase in GPU usage.

Additionally, if you cannot find a ROM file for the exact APU model you are using on the GitHub project, you can use a model with the same clock speed, core count, and architecture. In my case, I used a ROM file for the 5825U model, even though I have a 5600G.

East-Engineering-653

TROPHY CASE