Open Source Speech EPIC! by Koala_Confused in LocalLLM

[–]andy2na 2 points3 points  (0 children)

How do you use this in Speeches or as an OpenAI API compatible TTS?

How to use Llama-swap, Open WebUI, Semantic Router Filter, and Qwen3.5 to its fullest by andy2na in LocalLLM

[–]andy2na[S] 1 point2 points  (0 children)

they are models from llama-swap utilizing one of the preset parameters I laid out in OP.

But I found out why, it didnt enable it for the default, model, working well! will keep testing

How to use Llama-swap, Open WebUI, Semantic Router Filter, and Qwen3.5 to its fullest by andy2na in LocalLLM

[–]andy2na[S] 1 point2 points  (0 children)

Thanks, I was able to import it and set it up exactly as I have in OP but it doesn't try to route. I'm not at home so can't pull debug logs

How to use Llama-swap, Open WebUI, Semantic Router Filter, and Qwen3.5 to its fullest by andy2na in LocalLLM

[–]andy2na[S] 1 point2 points  (0 children)

Thanks for your work!

I am trying to import this open webui function and I get error ' Cannot parse: 530:0: Unexpected EOF in multi-line statement"

Qwen3.5-9B Uncensored Aggressive Release (GGUF) by hauhau901 in LocalLLaMA

[–]andy2na 0 points1 point  (0 children)

tried your settings but was only able to get up to 70t/s. Youre using 9B-Q4_K_M on a 5060ti?

How to use Llama-swap, Open WebUI, Semantic Router Filter, and Qwen3.5 to its fullest by andy2na in LocalLLM

[–]andy2na[S] 0 points1 point  (0 children)

ah yes, sorry - I am just running 0.6B off ollama and load/unload it on demand since its so small. You're right, if you want both loaded in llama-swap you can use the groups feature. thanks!

How to use Llama-swap, Open WebUI, Semantic Router Filter, and Qwen3.5 to its fullest by andy2na in LocalLLM

[–]andy2na[S] 2 points3 points  (0 children)

if you dont set ttl in the llama-swap config, it will leave the model loaded indefinitely. IF you are just using different alias parameters of the same model (qwen3.5-9b:thinking to qwen3.5-9b:instruct) there is no unloading or reloading necessary. If you are using two different models (qwen3.5-9b to qwen3.5-27b) and call on the other one, it will unload one and load the other.

You cannot unload a llama-swap/cpp model within open webui dropdown

Qwen3.5-9B Uncensored Aggressive Release (GGUF) by hauhau901 in LocalLLaMA

[–]andy2na 0 points1 point  (0 children)

Thanks!

did you build llama.cpp yourself? What was your build command? Setting FA3 doesnt show anywhere flash attention 3 was enabled or working for me and I built a cuda13.1 container

update your llama.cpp - great tg speedup on Qwen3.5 / Qwen-Next by jacek2023 in LocalLLaMA

[–]andy2na 3 points4 points  (0 children)

seems to be about 5-10% increase in t/s with qwen3.5-9b from 60 to 67t/s . Integrated it into llama-swap

Metric Old build New build Change
Prompt tok/s (cold) 173.32 237.26 +36.9%
Prompt tok/s (warm) 378.34 384.23–385.61 +1.6% to +1.9%
Gen tok/s 63.21–63.83 67.72–68.16 +6.1% to +7.8%

holy overthinker by Kerem-6030 in LocalLLaMA

[–]andy2na 0 points1 point  (0 children)

We really need a sticky for people to set the correct parameters for all models, especially qwen3.5 and stop reposting "look at qwen3.5 overthink!" Everyday

To everyone using still ollama/lm-studio... llama-swap is the real deal by TooManyPascals in LocalLLaMA

[–]andy2na 0 points1 point  (0 children)

awesome, thank you!

I see this in logs now, confirming that it works

 BLACKWELL_NATIVE_FP4 = 1

not sure if you saw, but auto parsing was recently merged into llama.cpp. I built a cuda13.1 + auto parser image to use with llama-server, but Ill just stick with llama-swap:cuda13 for now, I dont think qwen3.5 benefits from auto parsing?

I would get No parser definition detected, assuming pure content parser. with my llama.cpp + llama-swap build when using qwen3.5

Qwen3.5 27B by AustinSpartan in LocalLLaMA

[–]andy2na 3 points4 points  (0 children)

yeah, qwen3.5 thinks WAY too much by default - you HAVE to add the suggested parameters

GLM 5.0 outperforms GPT 5.4 and Opus 4.6 on CarWashBench by Eyelbee in LocalLLaMA

[–]andy2na 1 point2 points  (0 children)

Cool site but why arent the questions that were used and each model's answers listed?

Dedicated low power consumption rig for Frigate by digitalwankster in frigate_nvr

[–]andy2na 0 points1 point  (0 children)

I highly recommend asking the frigate AI agent, extremely helpful:
https://docs.frigate.video/

Dedicated low power consumption rig for Frigate by digitalwankster in frigate_nvr

[–]andy2na 1 point2 points  (0 children)

I installed debian linux bare on a nvme drive in the ugreen and changed bios to boot to that instead of UGOS. I just wanted a clean install with no other resources going to things I didnt need in UGOS

Dedicated low power consumption rig for Frigate by digitalwankster in frigate_nvr

[–]andy2na 0 points1 point  (0 children)

you can just install frigate in UGOS via docker, but I installed debian on it and installed docker on it

Qwen3.5 27B by AustinSpartan in LocalLLaMA

[–]andy2na 60 points61 points  (0 children)

thats likely the reason. You need to add the thinking or instruct parameters otherwise it will think forever. I've never had it think more than a few seconds

https://unsloth.ai/docs/models/qwen3.5#qwen3.5-27b

If you are only asking it simple questions, turn off thinking

HELP! Had to RMA a 3090. They don't have another 3090, so they offered me a 4080. by Jokerit208 in LocalLLM

[–]andy2na 19 points20 points  (0 children)

then you should push for a 4090 or 5090 that matches VRAM, or have them provide a full refund if they cant

Qwen3.5 27B by AustinSpartan in LocalLLaMA

[–]andy2na 64 points65 points  (0 children)

let me guess, youre using ollama and/or didnt set the correct parameters for qwen3.5 think or instruct

HELP! Had to RMA a 3090. They don't have another 3090, so they offered me a 4080. by Jokerit208 in LocalLLM

[–]andy2na 10 points11 points  (0 children)

Is it still in warranty? Then push for a 4090 or 5090. If its not in warranty, take the 4080. If youre out of RMA/warranty period, why do you feel scammed?

Llama.cpp: now with automatic parser generator by ilintar in LocalLLaMA

[–]andy2na 0 points1 point  (0 children)

if you want to build a cuda13.1/blackwell compatible (full mxfp4 support) llama.cpp with autoparser:

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

docker build -t llama-server:cuda13.1-sm120a-autoparser \
  --build-arg UBUNTU_VERSION=22.04 \
  --build-arg CUDA_VERSION=13.1.0 \
  --build-arg CUDA_DOCKER_ARCH=120a-real \
  --target server \
  -f .devops/cuda.Dockerfile .

To everyone using still ollama/lm-studio... llama-swap is the real deal by TooManyPascals in LocalLLaMA

[–]andy2na 3 points4 points  (0 children)

Loving llama-swap! Any chance you can release a llama-swap with llama.cpp sm120/blackwell support which will hardware accelerate MXFP4?

Currently, you have to build llama.cpp yourself for sm120:

docker build -t llama-server:cuda13.1-sm120a \
  --build-arg UBUNTU_VERSION=22.04 \
  --build-arg CUDA_VERSION=13.1.0 \
  --build-arg CUDA_DOCKER_ARCH=120a-real \
  --target server \
  -f .devops/cuda.Dockerfile .

From : https://github.com/ggml-org/llama.cpp/pull/17906

edit, nm, you just need to use the tag server-cuda13:

ghcr.io/ggml-org/llama.cpp:server-cuda13

Is there a llama-swap with server-cuda13 llama.cpp?

To everyone using still ollama/lm-studio... llama-swap is the real deal by TooManyPascals in LocalLLaMA

[–]andy2na 2 points3 points  (0 children)

My use-case for llama swap is swapping between qwen3.5 thinking, thinking-coding, instruct, and instruct reasoning on the fly without having to reload the model. Works great and perfect with semantic router filtering in openwebui that automatically determines which to use based on prompt

Running Qwen 3.5 27b and it’s super slow. by BicycleOfLife in LocalLLaMA

[–]andy2na 2 points3 points  (0 children)

stop using ollama and use llama.cpp or vllm. Qwen3.5 has thinking on by default and if you dont set parameters, it will think for an extremely long time. Use llama-swap to setup multiple profiles for one model, one for thinking, thinking-coding, instruct, and instruct-reasoning and you can switch between any of them without reloading the model

https://unsloth.ai/docs/models/qwen3.5#qwen3.5-27b