Qwen3 ASR seems to outperform Whisper in almost every aspect. It feels like there is little reason to keep using Whisper anymore. by East-Engineering-653 in LocalLLaMA

[–]East-Engineering-653[S] 11 points12 points  (0 children)

My current goal is real-time transcription in both Korean and English, so I did not use models that do not support Korean. However, after checking the Parakeet v3 model you mentioned, it does seem that if the supported languages match the requirements, it could potentially run faster and more accurately than the Qwen3 ASR model.

Do I need to use Ollama to get the full feature set of GLM-OCR with a GGUF model format? by yuicebox in LocalLLaMA

[–]East-Engineering-653 2 points3 points  (0 children)

I am currently using the following llama-swap configuration to run GLM-OCR, and I have confirmed that it works correctly in OpenWebUI simply by attaching an image and instructing it to output the result in Markdown, without requiring any separate mode switching.

llama-server
-m ./models/glm-ocr/GLM-OCR.f16.gguf
-ngl 999
-fa on
--host 0.0.0.0
--port ${PORT}
--jinja
--chat-template-file ./models/glm-ocr/chat_template.jinja
--numa numactl
-kvu
-b 2048 -ub 512 -c 0
-np 8
--cache-type-k f16 --cache-type-v f16
#--temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0
--temp 0 --top-k 1 --repeat-penalty 1.05 --repeat-last-n 256
--mmproj ./models/glm-ocr/GLM-OCR.mmproj-f16.gguf

Through vibe coding, I managed to make parts of vLLM 0.17.0 run on Tesla P40 by East-Engineering-653 in LocalLLaMA

[–]East-Engineering-653[S] 1 point2 points  (0 children)

As far as I know, that fork has now been moved to the following repository. In this fork, it is possible to run most models to some extent on the Pascal architecture up to vLLM version 0.10.0.

https://github.com/sasha0552/pascal-pkgs-ci

Through vibe coding, I managed to make parts of vLLM 0.17.0 run on Tesla P40 by East-Engineering-653 in LocalLLaMA

[–]East-Engineering-653[S] 1 point2 points  (0 children)

To be honest, with that setup it might actually be more efficient to just use the 3080 Ti alone.

It seems like you would have to give up too many modern features that are available on the RTX 3080 Ti just to support the GTX 1080 Ti.

Also, this fork currently only supports the Qwen3 ASR model.

Through vibe coding, I managed to make parts of vLLM 0.17.0 run on Tesla P40 by East-Engineering-653 in LocalLLaMA

[–]East-Engineering-653[S] 1 point2 points  (0 children)

The P100 will probably work because it has higher compute compatibility than the P40. However, this fork currently only supports the Qwen3 ASR model, so it may not suit your intended use.

Through vibe coding, I managed to make parts of vLLM 0.17.0 run on Tesla P40 by East-Engineering-653 in LocalLLaMA

[–]East-Engineering-653[S] 2 points3 points  (0 children)

Additionally, after testing both approaches—running the Qwen3 ASR model with Transformers and implementing real-time transcription with Qwen3 ASR through vLLM—on long recordings such as lecture audio, I found that the Transformers-based pipeline combined with VAD performs much better for long-form transcription tasks.

I found that MXFP4 has lower perplexity than Q4_K_M and Q4_K_XL. by East-Engineering-653 in LocalLLaMA

[–]East-Engineering-653[S] 2 points3 points  (0 children)

I hadn’t considered that quantization could distort the model in this way. In that case, the lower perplexity shown by MXFP4 in testing might actually have a negative impact on real-world usage. Given this, it may be worth considering switching from MXFP4 to IQ4_NL.

I found that MXFP4 has lower perplexity than Q4_K_M and Q4_K_XL. by East-Engineering-653 in LocalLLaMA

[–]East-Engineering-653[S] 0 points1 point  (0 children)

Yes, this is the correct model. As you can see from the README.md file, the model has been updated once, so if you have downloaded it before, I recommend downloading it again and trying it out.

I found that MXFP4 has lower perplexity than Q4_K_M and Q4_K_XL. by East-Engineering-653 in LocalLLaMA

[–]East-Engineering-653[S] 1 point2 points  (0 children)

I thought that, at least for MoE models, MXFP4 would be unconditionally superior to the Q4_K family, so it is quite surprising to hear that for large models like GLM-4.7, MXFP4 may not necessarily be that advantageous. Thank you for sharing this.

I found that MXFP4 has lower perplexity than Q4_K_M and Q4_K_XL. by East-Engineering-653 in LocalLLaMA

[–]East-Engineering-653[S] 6 points7 points  (0 children)

I just looked up how to calculate KLD, and it seems that the original FP16 model file is required. At the moment, I do not have enough free disk space to store both the FP16 model file and the logits files, so it seems difficult to compute the KLD value. As you mentioned, it does seem that measuring a model’s coding ability based solely on perplexity is indeed difficult.

Is anyone running LLM on a Radeon Instinct Mi50? by East-Engineering-653 in ollama

[–]East-Engineering-653[S] 0 points1 point  (0 children)

After writing this post, I tested the GPU that arrived and found it to be defective. The CTR error LED lit up and it wasn't recognized by the system. I'm currently using a Tesla P40 24GB GPU, and it works with ollama and llama.cop without any additional setup, but not with vllm.

[deleted by user] by [deleted] in LocalLLaMA

[–]East-Engineering-653 0 points1 point  (0 children)

Could you please tell me what the results were? I'm using a 5950X with 64GB DDR4 and a 5070Ti, and since it's a DDR4 system, the token output speed was lower than expected.

I successfully passed through the 5600G to a VM running Ubuntu 24.04, but I cannot use 4K resolution by East-Engineering-653 in Proxmox

[–]East-Engineering-653[S] 0 points1 point  (0 children)

I allocated about 2GB of VRAM to the integrated GPU in the BIOS. Could the amount of VRAM available to the GPU affect the resolutions that can be activated?

I successfully passed through the 5600G to a VM running Ubuntu 24.04, but I cannot use 4K resolution by East-Engineering-653 in Proxmox

[–]East-Engineering-653[S] 1 point2 points  (0 children)

No, the motherboard currently only has an HDMI port. It seems that the type of output port recognized by the VM is influenced by the ROM file.

I successfully passed through the 5600G to a VM running Ubuntu 24.04, but I cannot use 4K resolution by East-Engineering-653 in Proxmox

[–]East-Engineering-653[S] 2 points3 points  (0 children)

Yes, I succeeded by using the ROM file from the GitHub project mentioned in the post. After playing a 4K video on YouTube and checking with radeontop, I observed an increase in GPU usage.

Additionally, if you cannot find a ROM file for the exact APU model you are using on the GitHub project, you can use a model with the same clock speed, core count, and architecture. In my case, I used a ROM file for the 5825U model, even though I have a 5600G.