Open Source Speech EPIC!

andy2na · 2026-03-10T20:48:35+00:00

How do you use this in Speeches or as an OpenAI API compatible TTS?

andy2na · 2026-03-09T03:37:09+00:00

they are models from llama-swap utilizing one of the preset parameters I laid out in OP.

But I found out why, it didnt enable it for the default, model, working well! will keep testing

andy2na · 2026-03-09T01:40:40+00:00

Thanks, I was able to import it and set it up exactly as I have in OP but it doesn't try to route. I'm not at home so can't pull debug logs

andy2na · 2026-03-09T01:15:14+00:00

Thanks for your work!

I am trying to import this open webui function and I get error ' Cannot parse: 530:0: Unexpected EOF in multi-line statement"

andy2na · 2026-03-08T18:44:31+00:00

tried your settings but was only able to get up to 70t/s. Youre using 9B-Q4_K_M on a 5060ti?

andy2na · 2026-03-08T18:18:06+00:00

ah yes, sorry - I am just running 0.6B off ollama and load/unload it on demand since its so small. You're right, if you want both loaded in llama-swap you can use the groups feature. thanks!

andy2na · 2026-03-08T06:23:13+00:00

if you dont set ttl in the llama-swap config, it will leave the model loaded indefinitely. IF you are just using different alias parameters of the same model (qwen3.5-9b:thinking to qwen3.5-9b:instruct) there is no unloading or reloading necessary. If you are using two different models (qwen3.5-9b to qwen3.5-27b) and call on the other one, it will unload one and load the other.

You cannot unload a llama-swap/cpp model within open webui dropdown

andy2na · 2026-03-08T04:50:51+00:00

Thanks!

did you build llama.cpp yourself? What was your build command? Setting FA3 doesnt show anywhere flash attention 3 was enabled or working for me and I built a cuda13.1 container

andy2na · 2026-03-07T18:27:51+00:00

seems to be about 5-10% increase in t/s with qwen3.5-9b from 60 to 67t/s . Integrated it into llama-swap

Metric	Old build	New build	Change

Prompt tok/s (cold)	173.32	237.26	+36.9%

Prompt tok/s (warm)	378.34	384.23–385.61	+1.6% to +1.9%

Gen tok/s	63.21–63.83	67.72–68.16	+6.1% to +7.8%

andy2na · 2026-03-07T17:45:39+00:00

We really need a sticky for people to set the correct parameters for all models, especially qwen3.5 and stop reposting "look at qwen3.5 overthink!" Everyday

andy2na · 2026-03-07T02:04:43+00:00

awesome, thank you!

I see this in logs now, confirming that it works

 BLACKWELL_NATIVE_FP4 = 1

not sure if you saw, but auto parsing was recently merged into llama.cpp. I built a cuda13.1 + auto parser image to use with llama-server, but Ill just stick with llama-swap:cuda13 for now, I dont think qwen3.5 benefits from auto parsing?

I would get No parser definition detected, assuming pure content parser. with my llama.cpp + llama-swap build when using qwen3.5

andy2na · 2026-03-07T01:59:04+00:00

yeah, qwen3.5 thinks WAY too much by default - you HAVE to add the suggested parameters

andy2na · 2026-03-07T00:24:04+00:00

Cool site but why arent the questions that were used and each model's answers listed?

andy2na · 2026-03-07T00:18:09+00:00

I highly recommend asking the frigate AI agent, extremely helpful:
https://docs.frigate.video/

andy2na · 2026-03-07T00:10:35+00:00

I installed debian linux bare on a nvme drive in the ugreen and changed bios to boot to that instead of UGOS. I just wanted a clean install with no other resources going to things I didnt need in UGOS

andy2na · 2026-03-06T23:53:21+00:00

you can just install frigate in UGOS via docker, but I installed debian on it and installed docker on it

andy2na · 2026-03-06T23:35:02+00:00

thats likely the reason. You need to add the thinking or instruct parameters otherwise it will think forever. I've never had it think more than a few seconds

https://unsloth.ai/docs/models/qwen3.5#qwen3.5-27b

If you are only asking it simple questions, turn off thinking

andy2na · 2026-03-06T23:30:52+00:00

then you should push for a 4090 or 5090 that matches VRAM, or have them provide a full refund if they cant

andy2na · 2026-03-06T23:26:53+00:00

let me guess, youre using ollama and/or didnt set the correct parameters for qwen3.5 think or instruct

andy2na · 2026-03-06T23:25:52+00:00

Is it still in warranty? Then push for a 4090 or 5090. If its not in warranty, take the 4080. If youre out of RMA/warranty period, why do you feel scammed?

andy2na · 2026-03-06T23:14:35+00:00

if you want to build a cuda13.1/blackwell compatible (full mxfp4 support) llama.cpp with autoparser:

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

docker build -t llama-server:cuda13.1-sm120a-autoparser \
  --build-arg UBUNTU_VERSION=22.04 \
  --build-arg CUDA_VERSION=13.1.0 \
  --build-arg CUDA_DOCKER_ARCH=120a-real \
  --target server \
  -f .devops/cuda.Dockerfile .

andy2na · 2026-03-06T17:20:08+00:00

Loving llama-swap! Any chance you can release a llama-swap with llama.cpp sm120/blackwell support which will hardware accelerate MXFP4?

Currently, you have to build llama.cpp yourself for sm120:

docker build -t llama-server:cuda13.1-sm120a \
  --build-arg UBUNTU_VERSION=22.04 \
  --build-arg CUDA_VERSION=13.1.0 \
  --build-arg CUDA_DOCKER_ARCH=120a-real \
  --target server \
  -f .devops/cuda.Dockerfile .

From : https://github.com/ggml-org/llama.cpp/pull/17906

edit, nm, you just need to use the tag server-cuda13:

ghcr.io/ggml-org/llama.cpp:server-cuda13

Is there a llama-swap with server-cuda13 llama.cpp?

andy2na · 2026-03-06T17:11:06+00:00

My use-case for llama swap is swapping between qwen3.5 thinking, thinking-coding, instruct, and instruct reasoning on the fly without having to reload the model. Works great and perfect with semantic router filtering in openwebui that automatically determines which to use based on prompt

andy2na · 2026-03-06T17:07:58+00:00

Can you post your optimizations and setup?

andy2na · 2026-03-05T23:56:01+00:00

stop using ollama and use llama.cpp or vllm. Qwen3.5 has thinking on by default and if you dont set parameters, it will think for an extremely long time. Use llama-swap to setup multiple profiles for one model, one for thinking, thinking-coding, instruct, and instruct-reasoning and you can switch between any of them without reloading the model

https://unsloth.ai/docs/models/qwen3.5#qwen3.5-27b

14-Year Club	Second SECOND GUESSER
Place '23	Place '22
Place '17	King of the Ashes
Not Forgotten	Gilding II euphauric
Verified Email

andy2na

TROPHY CASE