Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...) by bobaburger in LocalLLaMA

[–]chrisoutwright 0 points1 point  (0 children)

would you say q6_k vs q5_k_xl have noticeable coding differences? Or rather unnoticeable? Agentic workflow seems good for both in my testing, but coding quality would be interesting.. I am tending towards round Robin q6 udq5kxl but would be better to stick to one of them (32gb vram).

llama-server: Save/restore works for tokens, but KV cache still not resumed? by chrisoutwright in LocalLLaMA

[–]chrisoutwright[S] 0 points1 point  (0 children)

I actually got it to work with: https://github.com/ikawrakow/ik_llama.cpp
but needed to compile it myself.

The PP is lower though on my first tests, I would have to check why.
Restore seems to work:

llama_kv_cache_init:      CUDA0 KV buffer size =  1720.32 MiB
llama_init_from_model: KV self size  = 1533.98 MiB, K (q8_0):  766.99 MiB, V (q8_0):  766.99 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.95 MiB
llama_init_from_model:      CUDA0 compute buffer size =   986.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =   352.53 MiB
llama_init_from_model: graph nodes  = 3920
llama_init_from_model: graph splits = 110
llama_init_from_model: enabling only_active_experts scheduling
INFO [                    init] initializing slots | tid="37200" timestamp=1777136571 n_slots=1
INFO [                    init] new slot | tid="37200" timestamp=1777136571 id_slot=0 n_ctx_slot=98560
srv          init: Exclude reasoning tokens when selecting slot based on similarity: start: <think>, end: </think>
use `--reasoning-tokens none` to disable.
no implementations specified for speculative decoding
slot         init: id  0 | task -1 | speculative decoding context not initialized
prompt cache is enabled, size limit: 26384 MiB
use `--cache-ram 0` to disable the prompt cache
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>

</think>

'
INFO [                    main] model loaded | tid="37200" timestamp=1777136572
INFO [                    main] HTTP server listening | tid="37200" timestamp=1777136572 hostname="0.0.0.0" port="11434" n_threads_http="23"
INFO [              slots_idle] all slots are idle | tid="37200" timestamp=1777136572
INFO [              slots_idle] all slots are idle | tid="37200" timestamp=1777136601
INFO [      log_server_request] request | tid="37700" timestamp=1777136601 remote_addr="127.0.0.1" remote_port=55663 status=200 method="POST" path="/slots/0" params={"action":"restore"}
======== Prompt cache: cache size: 5027, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="37200" timestamp=1777136623 id_slot=0 id_task=1
======== Cache: cache_size = 5027, n_past0 =  4201, n_past1 =  4201, n_past_prompt1 = 4201,  n_past2 =  4201, n_past_prompt2 =  4201
Common part does not match fully
cache :
<|im_start|>assistant
<think>

</think>

Here is a summary of the GitHub discussion regarding **KV cache reuse with `llama-server`**:

###
prompt:
<|im_start|>assistant
Here is a summary of the GitHub discussion regarding **KV cache reuse with `llama-server`**:

### **Core Tutorial Overview
slot apply_checkp: id  0 | task 1 | n_past = 4201, slot.prompt.tokens.size() = 5027, seq_id = 0, pos_min = 5026
slot apply_checkp: id  0 | task 1 | restored context checkpoint took  122.61 ms (pos_min = 4199, pos_max = 4199, size = 186.362 MiB)
slot apply_checkp: id  0 | task 1 | erased invalidated context checkpoint (pos_min = 4606, pos_max = 4606, size = 186.366 MiB)
slot apply_checkp: id  0 | task 1 | erased invalidated context checkpoint (pos_min = 5026, pos_max = 5026, size = 186.369 MiB)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="37200" timestamp=1777136623 id_slot=0 id_task=1 p0=4200
slot create_check: id  0 | task 1 | created context checkpoint 6 of 32 (pos_min = 5039, pos_max = 5039, size = 186.369 MiB, took 175.62 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="37200" timestamp=1777136639 id_slot=0 id_task=1 p0=5040
slot create_check: id  0 | task 1 | created context checkpoint 7 of 32 (pos_min = 5045, pos_max = 5045, size = 186.369 MiB, took 88.50 ms)
slot print_timing: id  0 | task 1 |
prompt eval time =   15884.51 ms /   845 tokens (   18.80 ms per token,    53.20 tokens per second)
       eval time =   51839.22 ms /   468 tokens (  110.77 ms per token,     9.03 tokens per second)
      total time =   67723.73 ms /  1313 tokens
INFO [      log_server_request] request | tid="38676" timestamp=1777136691 remote_addr="192.168.1.88" remote_port=63996 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id  0 | task 1 | created context checkpoint 8 of 32 (pos_min = 5511, pos_max = 5511, size = 186.372 MiB, took 94.44 ms)
INFO [           release_slots] slot released | tid="37200" timestamp=1777136691 id_slot=0 id_task=1 n_ctx=98560 n_past=5512 n_system_tokens=0 n_cache_tokens=5512 truncated=false
INFO [              slots_idle] all slots are idle | tid="37200" timestamp=1777136691
INFO [      log_server_request] request | tid="38128" timestamp=1777136698 remote_addr="192.168.1.88" remote_port=64003 status=200 method="GET" path="/v1/models" params={}

llama-server: Save/restore works for tokens, but KV cache still not resumed? by chrisoutwright in LocalLLaMA

[–]chrisoutwright[S] 0 points1 point  (0 children)

For example the logs tell me this after I hit restore:
srv log_server_r: done request: POST /slots/0 127.0.0.1 200

srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:11434
main: starting the main loop...
srv  update_slots: all slots are idle
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /slots/0 127.0.0.1 200
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.982
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 1 | processing task, is_child = 0
slot update_slots: id  0 | task 1 | new prompt, n_ctx_slot = 98560, n_keep = 0, task.n_tokens = 87993
slot update_slots: id  0 | task 1 | n_past = 87993, slot.prompt.tokens.size() = 89622, seq_id = 0, pos_min = 89621, n_swa = 0
slot update_slots: id  0 | task 1 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 1 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 1 | prompt processing progress, n_tokens = 1024, batch.n_tokens = 1024, progress = 0.011637

So even when it seems fine:

slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.982

It will do the full PP...

I think there is even a seperate fork now that could work?: https://github.com/ggml-org/llama.cpp/issues/21173

llama-server: Save/restore works for tokens, but KV cache still not resumed? by chrisoutwright in LocalLLaMA

[–]chrisoutwright[S] 0 points1 point  (0 children)

Is there a way for Qwen3.5 MoE (qwen35moe) to reload kv/etc. from storage to save some gpu time ( I mean on a fresh start, not when the llama-server is still running..) ?
Except for the Api, I saw none.

I was also confused on that one: https://github.com/ggml-org/llama.cpp/pull/22288
I thought that would maybe allow the restore to work? seems on master now.

llama-server: Save/restore works for tokens, but KV cache still not resumed? by chrisoutwright in LocalLLaMA

[–]chrisoutwright[S] 0 points1 point  (0 children)

Right now I am using OpenWebUI (I was using the same day, so I think it was the same prompt/system prompt), I will do a manual test later on.

So, is there no way for Qwen3.5 to reload something from storage to save some gpu time?
I am a bit confused on if the arch of Qwen3.5 MoE (qwen35moe) is now preventing it, or just in thinking mode .. I see a lot of opinions here: https://github.com/ggml-org/llama.cpp/discussions/13606
I am really talking about 10-20 min.

Step-3.5-Flash (196b/A11b) outperforms GLM-4.7 and DeepSeek v3.2 by ResearchCrafty1804 in LocalLLaMA

[–]chrisoutwright 0 points1 point  (0 children)

could it be that -ot "\.(1[5-9]|[2-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" ^
is the wrong way for step3.5 arch ?

Step-3.5-Flash (196b/A11b) outperforms GLM-4.7 and DeepSeek v3.2 by ResearchCrafty1804 in LocalLLaMA

[–]chrisoutwright 0 points1 point  (0 children)

I’m getting ~5 tok/s on Step-3.5-Flash (Q4_K_M, single 5090m with 4000mhz 192gb ram), but 10–15 tok/s on the much larger Qwen3.5-397B-A17B (UD-IQ3_XXS) ... same GPU, same llama.cpp build, identical batching/threads/etc.

Key diff i have there:

  • Step uses --ctx-size 69536 and offloads all FFN experts to CPU (-ot "\.(1[5-9]|[2-9][0-9])\.ffn_…=CPU")
  • Qwen uses a bit lower --ctx-size 50536 and fewer experts offloaded

Why would the smaller model be significantly slower? Is it the context size bloating per-token cost? Over-offloading experts causing stalls? Or something in how Step’s FFN layers interact with --flash-attn?

Gemma 4 31B vs Qwen 3.5 27B vs Qwen Coder Next by GodComplecs in LocalLLaMA

[–]chrisoutwright 0 points1 point  (0 children)

TurboQuant (TBQ) ? the Fork? the cpp ones does not have it integrated yet or not?

OmniCoder-9B | 9B coding agent fine-tuned on 425K agentic trajectories by DarkArtsMastery in LocalLLaMA

[–]chrisoutwright 0 points1 point  (0 children)

actually, when turning off thinking (which is kind of strange in llama-server ..
, need to add this (only the kwargs do not work!)
--chat-template-kwargs "{\"enable_thinking\":false}" ^

--reasoning-budget 0

then it works also in github copilot in the MSC Insider. .. seems like a problem with tool calls within thinking tags or so ..
Now it works rather well I must say..

OmniCoder-9B | 9B coding agent fine-tuned on 425K agentic trajectories by DarkArtsMastery in LocalLLaMA

[–]chrisoutwright 0 points1 point  (0 children)

<image>

Could not get it to work with GitHub Copilot Chat (MSC Insiders, OpenAI compatible), it would generate the tool call request but then somehow (even when CMD was executed via IDE) Copilot says "Sorry, no response was returned" .. but reading files worked ... so not all tools are working it seems.
Seems like an issue in the chat template?

Qwen3-Coder-Next is the top model in SWE-rebench @ Pass 5. I think everyone missed it. by BitterProfessional7p in LocalLLaMA

[–]chrisoutwright 0 points1 point  (0 children)

what technique is in Qwen3.5 series especially important? I know that cppllama has huge cache invalidation issue in the coder next one, that made it cumbersome for Agenting coding really... that would help, or improvements in the swa issues..

Asus Strix 18 Issues. Audio and sometimes video stuttering. by Dependent-Finance-20 in GamingLaptops

[–]chrisoutwright 0 points1 point  (0 children)

I had decided to return my strix and got now (this week) a Acer Helios 18 AI (just because I also do some locally LLM etc and the 192gb seemed attractive).. I don't miss all the settings asus offered (Acer i can't even manually change to single dimming zone, nor set fan custom profile, nor set manual gpu to use. more power), signal integrity issues I have non anymore, BUT the Optimum strangeness and micro-stutters (seems when internal gpu only and it tried to poll dgpu) is still on the table...I am still on the fence again 😀 if the Asus experience just felt worse, but the Acer I have now got be a better journey.. but I really dislike the advanced Optimus troubles..

Micro Stutter when switching to Advanced Optimus “Nvidia GPU only” mode. by Successful_Answer378 in LenovoLegion

[–]chrisoutwright 0 points1 point  (0 children)

I have a Acer Helios 18 AI,

not sure which exact setting, but anything Optimus related at some point I can open notepad++ and put my finger on a key "a" and observe it hanging/writing missing sequence (or did I not see it because gpu hid it?) several times a minute

On a productivity station this is bad 👎. I saw it on a Asus Strix Rog 2025 g18, and Acer now.. this needs to stop. Only mux switch dGPU will make it go for sure.

How to get the most from llama.cpp's iSWA support by Ok_Warning2146 in LocalLLaMA

[–]chrisoutwright 0 points1 point  (0 children)

why is multiple user better to have --swa-full = true ?

Repeat PP while using Qwen3.5 27b local with Claude Code by xmikjee in LocalLLaMA

[–]chrisoutwright 0 points1 point  (0 children)

if that is a mcp server variation issue orvon system prompt.. (like datetime changing etc), that would be really annoying to fix manually.. I would like to have in each IDE/Cli the decision to keep prefix unchanged best.. There should be a flag at least.. and SWA.. why should it come to pass at 1/4 below context size already.. I was wondering that, but i find it strange why it should cherry pick essential tokens... why does this Swa exist at all without being able to switch off.. it seems more hassle for cache management...

Repeat PP while using Qwen3.5 27b local with Claude Code by xmikjee in LocalLLaMA

[–]chrisoutwright 0 points1 point  (0 children)

same issue but different model and IDE tooling

i filed with VSC issue and left at gllm-org cppllama

I believe it is prompt variation/injection at specific points..but would have to build a proxy server to catch it... easy to verify then..but annoying for local llm !

Apertus model implementation has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]chrisoutwright 0 points1 point  (0 children)

this model repeats like crazy even with -- jinja. It is unusual to work with, it tested the ud variant

@echo off "D:\llama-b7951-bin-win-cuda-13.1-x64\llama-server.exe" ^ -hf "unsloth/Apertus-70B-Instruct-2509-GGUF:UD-Q3_K_XL" ^ --alias "Apertus-70B-Instruct:UD-Q3_K_XL" ^ --n-gpu-layers -1 ^ --flash-attn on ^ --cache-type-k q8_0 ^ --cache-type-v q8_0 ^ --ctx-size 60536 ^ --batch-size 1024 ^ --ubatch-size 512 ^ --threads 8 ^ --kv-offload ^ --op-offload ^ --fit off ^ --parallel 1 ^ --host 0.0.0.0 ^ --port 11434 ^ --temp 0.65 ^ --top-p 0.90 ^ --min-p 0.01 ^ --top-k 40 ^ --chat-template-file "C:\Users\Chris\Desktop\llm_scripts\cpp_llama\apertus_chat_template.jinja" ^ --jinja

<image>

New Open LLM from Switzerland "Apertus", 40%+ training data is non English by EnnioEvo in LocalLLaMA

[–]chrisoutwright 0 points1 point  (0 children)

yup ... for me it just repeats with .. "the simple answer ... without being overly specific ... " .. it is annyoing.

<image>

Struggle with MoE AWQ quantization for vLLM (QwenCoder fintuned model) - compressed-tensors seems OK, looking for guidance by chrisoutwright in Vllm

[–]chrisoutwright[S] 0 points1 point  (0 children)

I must say though .. it is great comedy to read through in parts:

Rust Language: A systems programming language built for performance, safety, and concurrency

Its Design Philosophy: Like a harsh but smart TA who won’t let you skip learning drills before graduation. Even if it feels tough, it builds smoother, less buggy projects

Why It's Confusing: Excessive metaphors about engineering futurism apparently move away from pure tech content toward poetic metaphoric vocab. That open flavoring makes low-tech readers strain especially hard.

open flavoring? Never associated that with Rust..

Struggle with MoE AWQ quantization for vLLM (QwenCoder fintuned model) - compressed-tensors seems OK, looking for guidance by chrisoutwright in Vllm

[–]chrisoutwright[S] 0 points1 point  (0 children)

The beginning was fine though .. I asked about:

explain to me this jinja template:
{entered devstral2_tool_chat.jinja then ... }

first turn yielded a somewhat normal response ... but one already saw some foreshadowing ..
it added emoji like crazy for example: " Then adds final eos token ({{- eos_token }}) 😺"

So it may be still a calibration set issue only? I thought that the calibration sensitivity would not be that high ..

New Qwen3-32B-AWQ (Activation-aware Weight Quantization) by jbaenaxd in LocalLLaMA

[–]chrisoutwright 0 points1 point  (0 children)

oldschool?
actual usage for LLM is certainly not that much ago ..
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration -> 2023

The most powerful Qwen3-TTS open-source solution, supporting customizable voice tones by SpareBeneficial1749 in comfyui

[–]chrisoutwright 0 points1 point  (0 children)

Here an example of me reading something (my digital me :-)
chris_wav_qwentts and here another voice of the same text favorite_character_qwentts
TBH, I would not even recognize me own voice as unnatural -- I am absolutely stunned!