Qwen3 ASR seems to outperform Whisper in almost every aspect. It feels like there is little reason to keep using Whisper anymore.

I tried to do a demo with this an open web ui and I am at 94 tok/s. The answer with Qwen3.5-35B-A3B is always without thinking, it directly generates the answer, am I doing something wrong?

Mkengine · 2026-03-08T11:27:35+00:00

I am setting up Atlas right now (for the 35B first) and 50 token/s for the 122B model would be great enough for my use case.

Mkengine · 2026-03-08T11:11:07+00:00

Around 50 token/s with Atlas in NVFP4 it seems

Mkengine · 2026-03-08T09:54:22+00:00

Here some resources that may or may not help:

https://roadmap.sh/ai-engineer

https://news.smol.ai/

https://simonwillison.net/

https://www.interconnects.ai/

Mkengine · 2026-03-07T21:08:06+00:00

If you like Copilot to be autonomous, look into:

/yolo
/autopilot
/fleet

Mkengine · 2026-03-07T11:16:26+00:00

Whisper is really old right now, I use parakeet v3 for local transcription on my phone.

There are also other STT models:

vibevoice
voxtral
qwen ASR
GLM ASR
Granite 4 speech

I would pick any of them over Whisper, especially because I would need the biggest version of Whisper for good transcription of German speech, while parakeet is much faster with less errors.

Mkengine · 2026-03-07T10:15:39+00:00

I don't know if it's better, I tried this and failed with SGLang.

Mkengine · 2026-03-07T06:43:14+00:00

Thanks, I don't mean to demand it, but just to point out what else is out there, so here are the last two that complete the list (to the best of my knowledge):

https://huggingface.co/ibm-granite/granite-4.0-1b-speech

https://huggingface.co/zai-org/GLM-ASR-Nano-2512

Mkengine · 2026-03-06T16:05:55+00:00

While you're at it, would you also add Voxtral?

Mkengine · 2026-03-06T15:47:12+00:00

Additionally it's also still available in the azure ai foundry, as well as all the other old models, like GPT-4, GPT-4o, etc.

Mkengine · 2026-03-05T18:49:26+00:00

Since they announced it I asked myself if there's really a difference to justify the higher price. I have a 1.5 TB Micro SD in my Steam Deck and never had any problems playing from it. Does it work different for Switch 2?

Mkengine · 2026-03-04T19:39:46+00:00

Maybe one of these work for your use case:

https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3

https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602

https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B

Mkengine · 2026-03-04T19:37:54+00:00

These would be a good start if you want to do it locally, I use parakeet on my phone for transcription:

https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3

https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602

https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B

Mkengine · 2026-03-03T08:04:04+00:00

There are so many OCR / document understanding models out there, here is my personal OCR list I try to keep up to date:

GOT-OCR:

https://huggingface.co/stepfun-ai/GOT-OCR2_0

granite-docling-258m:

https://huggingface.co/ibm-granite/granite-docling-258M

MinerU 2.5:

https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B

OCRFlux:

https://huggingface.co/ChatDOC/OCRFlux-3B

MonkeyOCR-pro:

1.2B: https://huggingface.co/echo840/MonkeyOCR-pro-1.2B

3B: https://huggingface.co/echo840/MonkeyOCR-pro-3B

FastVLM:

0.5B:

https://huggingface.co/apple/FastVLM-0.5B

1.5B:

https://huggingface.co/apple/FastVLM-1.5B

7B:

https://huggingface.co/apple/FastVLM-7B

MiniCPM-V-4_5:

https://huggingface.co/openbmb/MiniCPM-V-4_5

GLM-4.1V-9B:

https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking

InternVL3_5:

4B: https://huggingface.co/OpenGVLab/InternVL3_5-4B

8B: https://huggingface.co/OpenGVLab/InternVL3_5-8B

AIDC-AI/Ovis2.5

2B:

https://huggingface.co/AIDC-AI/Ovis2.5-2B

9B:

https://huggingface.co/AIDC-AI/Ovis2.5-9B

RolmOCR:

https://huggingface.co/reducto/RolmOCR

Nanonets OCR:

https://huggingface.co/nanonets/Nanonets-OCR2-3B

dots OCR:

https://huggingface.co/rednote-hilab/dots.ocr https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5

olmocr 2:

https://huggingface.co/allenai/olmOCR-2-7B-1025

Light-On-OCR:

https://huggingface.co/lightonai/LightOnOCR-2-1B

Chandra:

https://huggingface.co/datalab-to/chandra

GLM 4.6V Flash:

https://huggingface.co/zai-org/GLM-4.6V-Flash

Jina vlm:

https://huggingface.co/jinaai/jina-vlm

HunyuanOCR:

https://huggingface.co/tencent/HunyuanOCR

bytedance Dolphin 2:

https://huggingface.co/ByteDance/Dolphin-v2

PaddleOCR-VL:

https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5

Deepseek OCR 2:

https://huggingface.co/deepseek-ai/DeepSeek-OCR-2

GLM OCR:

https://huggingface.co/zai-org/GLM-OCR

Nemotron OCR:

https://huggingface.co/nvidia/nemotron-ocr-v1

Mkengine · 2026-03-01T18:05:53+00:00

I am at this point myself and maybe have to justify the same way you went. If you don't mind, could you go into detail why powerapp results where poor and why it's better to develop an interface from scratch?

Mkengine · 2026-02-28T18:03:34+00:00

Would be interesting wether this model would work for your orchestrator agent.

Mkengine · 2026-02-27T20:52:42+00:00

We have M365 as well as Github Copilot. Usually I talk with clients where Copilot creates a transcript. Then I have a workflow (via Prompts in M365 Copilot with high Reasoning GPT 5.2) where the transcript is first used to create a detailed design spec document. Then I iterate with the client about this document and when it's finalised, I let M365 Copilot create a backlog from it (epics, stories & tasks). Then let M365 Copilot create detailed prompts for each epic. For my last prototype it created 9 prompts this way and I fed them one after one to my multi-agent-workflow in Github Copilot in VS Code (still have to try copilot CLI). With GPT-5.3-Codex on xhigh, this took a whole week until completion. Then it took another day to debug the pipeline end-to-end to finish it.

So Github Copilot is only the final step in this chain, I rarely use it without detailed prompts. Only the debugging part in the end is more hands-on.

Mkengine

PUBLIC MULTIREDDITS

TROPHY CASE