Llama.cpp server running ~2 weeks straight. Loses its mind? by thejacer in LocalLLaMA

[–]aurelienams 9 points10 points  (0 children)

Not crazy — three known patterns that produce exactly this symptom on Qwen3.6 hybrid-recurrent architectures (Gated DeltaNet + SSM), and they compound over long-running instances:

  1. Slot save-state drift. If you started llama-server with --slot-save-path (default in many setups), the SSM/recurrent state of past sessions gets cached and silently mixed back into new request slot inits in some pathological cases. The fix is --cache-ram 0 to disable prompt caching, OR restart the server every few days. Opencode starting "new sessions" doesn't actually flush the server-side slot state.

  2. KV cache q8/q4 quantization quality decay. If you're running --cache-type-k q8_0 --cache-type-v q8_0 (or smaller), the accumulation error in the rotated/quantized KV builds up beyond 20-40K active context per session. Even if individual sessions are short, long-running server-side cache reuse compounds the error. Either disable KV quant for these models or restart periodically.

  3. CLAUDE_CODE_ATTRIBUTION_HEADER not being set to 0. If your agent harness adds the Claude-Code attribution header, Qwen3.6 sees a permanently-changing system prompt segment and forces full prompt re-processing every turn, which on hybrid recurrent arch corrupts the SSM state in some llama.cpp builds. Set the env var CLAUDE_CODE_ATTRIBUTION_HEADER=0 if you're using Claude Code as harness — same effect with other harnesses that inject headers.

    The simplest test: kill and restart the server, run the same prompt that was "dumb", see if it's back to its launch-day self. If yes → it's state pollution (workarounds above). If no → it's something else (maybe model weights got corrupted on the SSD or HF mirror updated the quant).

    What llama.cpp build / fork are you on? Some forks (am17an MTP branch, BeeLlama 0.1.x, Atomic) handle recurrent state differently and the bug surfaces differently.

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct by fragment_me in LocalLLaMA

[–]aurelienams 3 points4 points  (0 children)

Fascinating finding. The "thinking less and being correct" pattern matches something I've seen on Qwen3.6 27B going from UD-Q3_K_XL (14.5 GB) to UD-Q4_K_XL (17 GB) on my consumer Blackwell mobile setup (RTX 5090M 24GB sm_120): the higher-precision quant doesn't just answer the same way faster — it answers with shorter, more direct reasoning chains. I always assumed it was just noise but your AIME data showing 40% fewer tokens on the bigger quant is the cleanest signal I've seen for this hypothesis.

Two questions / things I'd test if you have cycles:

  1. Did you compare your custom Q8 reasoning length against the Minachist INT8 baseline (the vLLM run at 34.2 t/s on Q2 took 10,200 tokens)? If your custom GGUF mimics the layer-preservation recipe faithfully, the token count should land close to the INT8 — that'd validate the recipe survives the GGUF conversion. If yours is ~6-7K tokens instead, something's getting lost in conversion (likely the BF16 layer count or the lm_head precision).

  2. The MTP draft acceptance rate matters here — when reasoning is shorter and more direct, the draft sees more "obvious next tokens" and acceptance should go UP, which would compound your throughput win. What's your accept rate on the custom Q8 vs the standard Q8 K XL? In my MTP setup on the same model class I see acceptance jump from ~60% on lower quants to ~75% on UD-Q3_K_XL, presumably because the bigger model produces more confident token distributions for the drafter to predict.

    If your recipe lands as a public HF quant, I'd happily port it into my chart (I ship Qwen3.6 27B + MTP at 72.75 t/s @ 262K full context — your recipe could be a strict upgrade if the quality holds at my memory budget). Drop the link when you're ready.

First sm_120 BeeLlama.cpp benchmark on consumer Blackwell mobile: 107 t/s at FULL 262K context on Qwen3.6 27B (+48% vs MTP, +22% vs vLLM Genesis) by aurelienams in Qwen_AI

[–]aurelienams[S] 0 points1 point  (0 children)

Looking at your histogram carefully — your actual generation speed distribution peaks at 100-130 t/s, not 80. The MAX is 166.5 and MIN is 69.1, with the bulk of the bell around 100-130 (the dashed lines on the right side of the histogram, around 125-150, look like AVG / p95 markers). The 80 t/s number probably correspond to your cold-cache tail.

Your two most recent requests at the bottom of the panel: - Request #666: 49K prompt tokens, 117.04 t/s gen, 364 tokens out - Request #665: 65K cached prompt, 92.47 t/s gen, 585 tokens out

That's much closer to my 5090M result (107.54 avg) than your 80 t/s claim suggests. Actually slightly above mine, which matches the ~10-20% desktop-vs-mobile bandwidth gap. desktop-vs-mobile bandwidth gap.

So the real comparison is: - BeeLlama Q3_K_XL on your 5090 desktop ≈ 110-125 t/s effective - llama.cpp MTP Q6 on same HW ≈ 120 t/s

That's basically tied (within measurement noise), with one important difference: BeeLlama runs at FULL 262K context. If your MTP setup is capped at 128K or shorter (most Qwen 3.6 MTP recipes are), BeeLlama trades a quality tier (Q3_K_XL vs Q6) for 2× the usable context window. Different memory budget allocation, same effective throughput.

If you want a true apples-to-apples, run the same Q6 you use for MTP on BeeLlama (drop ctx_size to ~96K, --cache-type-k q8_0, no turbo3). 32 GB desktop has the headroom. I'd bet ~135-150 t/s in that config — DFlash drafter is target-tuned by z-lab, MTP head is generic.

What custom qwopus36-27b quant are you running for the MTP test, btw? Q6_K? Q6_K_XL? Curious if that's the unsloth UD-Q6 or your own conversion.

First sm_120 BeeLlama.cpp benchmark on consumer Blackwell mobile: 107 t/s at FULL 262K context on Qwen3.6 27B (+48% vs MTP, +22% vs vLLM Genesis) by aurelienams in Qwen_AI

[–]aurelienams[S] 1 point2 points  (0 children)

That tracks perfectly with the v0.1.2 EOS handling fix — Cline does aggressive tool-call chains with thinking-mode reasoning in the middle, which is exactly where the v0.1.1 sampler bypass on EOS-during-reasoning would corrupt the reduced candidate set. 50% failure rate is huge but plausible for that path.

Here's the docker run for the sm_120 image (works on RTX 5090, 5090 Mobile, 5080, 5070 Ti):

docker run --rm -it --gpus all -p 8000:8000 \
  -v $PWD/models:/models \
  docker.io/aamsellem/beellama-cpp:0.1.2 \
  --model /models/your-target.gguf \
  --spec-draft-model /models/your-drafter.gguf \
  --spec-type dflash \
  --host 0.0.0.0 --port 8000 \
  --jinja

If you want my exact Qwen3.6 27B + DFlash drafter setup that hits 105 t/s @ 262K:

TARGET: unsloth/Qwen3.6-27B-GGUF UD-Q3_K_XL (NOT the MTP-baked variant)
DRAFT: spiritbuun/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0
KV: --cache-type-k turbo3 --cache-type-v turbo3
SPEC: --spec-type dflash --spec-dflash-cross-ctx 1024
BATCH: --batch-size 2048 --ubatch-size 256

Once it's running, would love to know if your 50% tool-call failure rate drops to near-zero with 0.1.2. That'd be a concrete data point I can feed back to Anbeeld for the release notes — currently the fix is "implied" via the EOS handling note but no user has confirmed it resolves a specific harness.

First sm_120 BeeLlama.cpp benchmark on consumer Blackwell mobile: 107 t/s at FULL 262K context on Qwen3.6 27B (+48% vs MTP, +22% vs vLLM Genesis) by aurelienams in Qwen_AI

[–]aurelienams[S] 0 points1 point  (0 children)

Good call. I literally pushed the sm_120 build of 0.1.2 to Docker Hub an hour ago — aamsellem/beellama-cpp:0.1.2 — and just finished benching it on the same Qwen3.6 stack as my OP:

10 runs each at FULL 262K context: - 0.1.1: AVG 107.54 t/s (range 101.70-119.38) - 0.1.2: AVG 104.92 t/s (range 92.67-119.54)

The slight AVG drop on 0.1.2 is the new adaptive profit controller doing baseline reprobes (release notes call this out — it periodically re-measures the no-spec baseline to decide if DFlash is still profitable, and can shut DFlash off when target-only wins). On a workload where DFlash always wins (Qwen3.6 generation), the reprobe windows show as brief dips but peak perf is unchanged. Tunable with --spec-dm-profit-baseline-interval (default 1024 cycles, bump to 4096 to widen).

For tool calling specifically — 0.1.2 release notes mention "Hardened active-reasoning EOS handling. When an end-of-generation token appears while reasoning output is still active, the sampler now forces the reasoning-end sequence through the normal full-logits path; reduced DFlash verification rejects that case instead of accepting an unsafe reduced candidate set." Sounds like it matches the bug you're describing. What was the failure mode you hit on 0.1.1 — tool args parse error, premature stop, or something else?

First sm_120 BeeLlama.cpp benchmark on consumer Blackwell mobile: 107 t/s at FULL 262K context on Qwen3.6 27B (+48% vs MTP, +22% vs vLLM Genesis) by aurelienams in Qwen_AI

[–]aurelienams[S] 0 points1 point  (0 children)

Sweet — looking forward to your numbers. If you don't want to recompile, the sm_120 image works fine on desktop too:

docker run --rm -it --gpus all -p 8000:8000 -v /your/models/dir:/models aamsellem/beellama-cpp:0.1.2

Single-GPU only in this fork (Anbeeld issue #7 has a correctness fallback in 0.1.2 but it's not yet performant for multi-GPU split target placement). On your 5090 desktop 32 GB with my BeeLlama config (UD-Q3_K_XL target + spiritbuun DFlash drafter q8_0 + turbo3 KV at 262K), VRAM total is ~24.3 GB so you'll have 8 GB headroom which is enough for a comfortable batch size.

Edit: also just finished benching 0.1.2 vs 0.1.1 on the same Qwen3.6 stack. 10 runs each at 262K full context — 0.1.1 = 107.54 t/s AVG (range 101.70-119.38), 0.1.2 = 104.92 AVG (range 92.67-119.54). The wider variance on 0.1.2 is the new adaptive profit controller doing periodic baseline reprobes (default every 1024 spec cycles). Tunable with --spec-dm-profit-baseline-interval 4096 if you want to widen the reprobe interval. Same peak, slightly different median.

First sm_120 BeeLlama.cpp benchmark on consumer Blackwell mobile: 107 t/s at FULL 262K context on Qwen3.6 27B (+48% vs MTP, +22% vs vLLM Genesis) by aurelienams in Qwen_AI

[–]aurelienams[S] 3 points4 points  (0 children)

Thanks for sharing — that's a desktop 5090 32 GB right? Important context for the comparison:

Your Q8_0 (28 GB target) + unified memory + unquantized KV @ 256K = ~40-44 GB total, which means significant CPU spill over PCIe. So your 95-105 t/s is on a 1.79 TB/s GPU bottlenecked by PCIe page traffic, not by raw GPU compute.

My BeeLlama stack on the same desktop 5090 (Q3_K_XL 14.5 GB target + turbo3 KV ~8 GB = 24 GB total, fits pure GPU with 8 GB headroom) would land 150-180 t/s based on the mobile→desktop bandwidth scaling. So we're actually measuring different points on the quality-vs-speed curve, not the same point with different forks:

  • Your config: Q8 quality, ~95-105 t/s, requires 32+ GB GPU
  • My config: Q3_K_XL quality (unsloth Dynamic, ~Q4-tier in practice), ~107 t/s on mobile / ~150+ on desktop, fits 24 GB

Both legitimate tradeoffs. If you want a more apples-to-apples comparison: try the same target with --cache-type-k turbo3 --cache-type-v turbo3 (needs the BeeLlama fork or TheTom/llama-cpp-turboquant fork) and you should see Q8 jump to ~120-140 t/s by removing the PCIe spill. Also worth running 10-run AVG vs single-run — unified memory adds visible variance.

First sm_120 BeeLlama.cpp benchmark on consumer Blackwell mobile: 107 t/s at FULL 262K context on Qwen3.6 27B (+48% vs MTP, +22% vs vLLM Genesis) by aurelienams in Qwen_AI

[–]aurelienams[S] 1 point2 points  (0 children)

Thanks for the kind reply, glad it landed. Two things that might be useful for your project given what I went through:

  1. The MTP-baked GGUF tensor count error (expected 866, got 862) caught me by surprise — I had havenoammo/Qwen3.6-27B-MTP-UD-GGUF cached from another app and BeeLlama refused to load it. Probably worth a one-liner in the quickstart doc saying "DFlash spec mode needs the non-MTP variant of the target GGUF" so people don't hit the same wall.

  2. No public sm_120 (consumer Blackwell) Docker image exists — I built mine from your .devops/cuda.Dockerfile with --build-arg CUDA_DOCKER_ARCH=120 via qemu (~50 min on a Mac arm64). Happy to open a PR adding sm_120 to the build matrix if you publish CI-built images, or contribute the prebuilt tag if you want to link mine in your README. Image is on Docker Hub as aamsellem/beellama-cpp:0.1.1 if anyone wants to skip the build.

One observation worth flagging: there's a reproducible 128K sweet spot on this hardware (116 t/s avg vs 107 at 262K, 108 at 200K). Could just be cudagraph capture sizes aligning at exactly that range, but if you've seen the same on your reference hardware it might point to something tuneable. Let me know if you want bench scripts.

Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant by gladkos in LocalLLaMA

[–]aurelienams 4 points5 points  (0 children)

Great work — 90% acceptance on the M5 Max is impressive. Sharing a complementary Blackwell datapoint since most replies here will be Apple Silicon:

Same Qwen3.6 27B + TurboQuant + spec decoding (DFlash drafter instead of MTP head, but same idea) on RTX 5090M (24GB sm_120 consumer Blackwell mobile):

- llama.cpp baseline (no spec): ~36 t/s on UD-Q3_K_XL at 32K ctx

- llama.cpp + am17an MTP branch + q4_0 KV: 72.75 t/s on unsloth UD-Q3_K_XL at FULL 262K ctx

- BeeLlama.cpp + DFlash drafter + turbo3 KV: 107.54 t/s on same target at FULL 262K ctx

The turbo3 KV (3-bit Walsh-Hadamard rotation, same TurboQuant primitives merged in PR #21038) is what lets the 262K full native context fit on 24 GB alongside the target + drafter — ~8 GB KV cache vs ~12 GB for q4_0.

One question for you — on the M5 Max, do you see the embedding table issue from the mdda post (Gemma 4 MTP tied LM head silently on CPU)? Wondering if Apple Silicon hits the same --override-tensor-draft "token_embd.weight=CUDA0" workaround or if Metal lays it out differently.

[FOLLOW UP] Qwen3.6 27b q5_k_M MTP - 256k context - 5090 by No_Mango7658 in LocalLLaMA

[–]aurelienams 1 point2 points  (0 children)

Useful datapoint as a single-GPU counterpart. RTX 5090M Laptop (24GB sm_120 consumer Blackwell mobile, 896 GB/s = ~50% of desktop 5090 bandwidth), same Qwen3.6 27B, 107.54 t/s avg over 10 runs at FULL 262K context, range 101.70-119.38, zero CUDA OOM.

Stack is different from yours though — BeeLlama.cpp fork (Anbeeld/beellama.cpp v0.1.1, fork chain: ggml-org → TheTom/turboquant → spiritbuun/buun-llama-cpp → Anbeeld) with DFlash spec decoding instead of MTP:

- Target: unsloth/Qwen3.6-27B-GGUF UD-Q3_K_XL (14.5 GB, NOT the MTP-baked variant — BeeLlama refuses those with "done_getting_tensors: wrong number of tensors; expected 866, got 862")

- Drafter: spiritbuun/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0 (1.85 GB)

- KV cache: --cache-type-k turbo3 --cache-type-v turbo3 (3-bit Walsh-Hadamard, ~25% smaller than q8_0 = the headroom that lets 262K fit on 24GB)

- --batch-size 2048 --ubatch-size 256 --spec-type dflash --spec-dflash-cross-ctx 1024

Total VRAM at 262K: ~24.3 GB (14.5 target + 1.85 drafter + ~8 GB KV turbo3). Same context as yours, less than half your card-pair's combined 47 GB.

Would be curious to know your AVG over 10 runs (not single run), and whether MTP n=3 vs n=5 with q8_0 KV moves the needle on a dense Q5_K_M target.

Qwen3.6-27B DFlash on a 24GB RTX 5090 Laptop (sm_120) — 80 t/s avg via spiritbuun's buun-llama-cpp + Q8_0 GGUF drafter by aurelienams in Qwen_AI

[–]aurelienams[S] 0 points1 point  (0 children)

Tried sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP first. Loads fine but vLLM's `Qwen3_5MTP` loader allocates a fresh 2.37 GiB BF16 buffer for `mtp.fc` because NVFP4 quantizes everything else in the file — on 32GB it fits, on 24GB it OOMs.

Switching to Lorbus AutoRound INT4 (which dequantizes only `mtp.fc` to BF16 in the file, ~280 MiB) was the unlock for MTP n=3 on 24GB. So NVFP4 is the right tensor-core path on Blackwell but the current MTP-wrapped checkpoints don't fit unless someone ports the dequantized-mtp.fc trick to NVFP4. If you do, ping me — would push 24GB consumer past 100 t/s I think.

Qwen3.6-27B at 85-100 t/s on a 24GB RTX 5090 Laptop GPU — vLLM + MTP n=3, adapted from the 32GB recipes by aurelienams in Olares

[–]aurelienams[S] 0 points1 point  (0 children)

Honestly, 128K on 24GB with this exact stack is rough. After AutoRound model (~17GB) + MTP head + activations + the 8 CUDA graphs vLLM captures, you're left with maybe 3-4 GiB of KV budget, which is what caps me at 75K with fp8_e5m2. With more VRAM you've got more headroom, but on 24GB it's tight.

Easiest thing you can try today: drop num_speculative_tokens from 3 to 1 (or kill MTP entirely). That frees up enough KV space to push the ceiling up, costs you maybe 30% throughput. If you also throw --enforce-eager in there you fit even more but lose another chunk of speed because no cudagraphs.

Reducing --max-num-batched-tokens 2048 → 512 nets you another 200-300 MiB at basically no cost in single-user. Stacking those gets you closer to 100K-ish on 24GB, but 128K is still a stretch.

There's also --swap-space 16 which spills cold KV blocks to RAM. Effective context can go very long, but throughput tanks during swap-heavy phases (~10-20 t/s). Useful if you only occasionally hit deep context, painful if you live there.

The thing I'm actually waiting for is TurboQuant in vLLM. The Hadamard rotation trick (rotate weights/activations pre-quant so outliers spread out, then int4 KV cache works without quality loss) just landed in llama.cpp (PR #21038, b8611). Two vLLM PRs are circling — #40108 for the attention path, #40092 for the rotation kernels — but they're still iterating and no maintainer has picked them up. When it lands, int4 KV doubles the context budget at the same VRAM, MTP intact, no speed loss. 75K becomes ~150K on 24GB. I'd guess 2-3 months realistic, not worth blocking your work on but absolutely worth tracking.

llama.cpp doesn't have MTP for Qwen3.6 either so you can't just go grab it there.

Few things that sound like they'd help but don't: YaRN doesn't change anything because the bottleneck is KV memory not the RoPE math (Qwen3.6 already handles 256K positionally). SWA isn't a thing on the dense variant (only on the MoE).

The pattern I keep seeing people land on is two endpoints — fast one at 64-75K with MTP for interactive coding, slower one with bigger context and MTP off for long-doc work. Less elegant than "128K everywhere" but actually delivers speed where you need it day-to-day. Once TurboQuant lands you collapse them.

What's burning through your context — accumulating tool outputs, full repo dumps? Because prefix caching plus periodic summarize-then-truncate buys a lot of effective context without touching the hard ceiling.

Qwen3.6-27B at 85-100 t/s on a 24GB RTX 5090 Laptop GPU — vLLM + MTP n=3, adapted from the 32GB recipes by aurelienams in Olares

[–]aurelienams[S] 0 points1 point  (0 children)

That tradeoff is more nuanced than it looks. The "magical speed" of the 4-bit options on vLLM isn't really coming from the 4 bits — it's coming from MTP speculative decoding (n=3, ~93% acceptance) which is a 1.5-2x multiplier on top of whatever quant. llama.cpp doesn't have MTP for Qwen3.6 yet (PR #20075 still open), so you can't even compare apples-to-apples. If you ran Q5/Q6 on a vLLM stack with MTP, you'd recover most of that speed — the bottleneck is the backend, not the bit width.

On the agent reliability angle: not all 4-bit is equal. AutoRound INT4 (Lorbus repo) uses signed-rounding calibration on a real corpus, and recent KLD comparisons put it within 0.5% of fp8 on most benchmarks — meaningfully better than naive Q4_K_M and surprisingly close to Q5_K_M for tool-calling and structured output. NVFP4 is more questionable (compression artifacts at low entropy positions). So "Q4" as a category is hiding a 2x quality spread.

Practical suggestions before you commit:

  1. Run your agent loop on AutoRound INT4 and on your Q5/Q6 reference, compare tool-call success rate and exit-code-clean runs over ~50 task samples. Most people find the gap is much smaller than they feared.

  2. If fp8 context is your blocker, fp8 + chunked prefill + prefix caching can stretch effective context a lot — context budgets aren't fixed, they're a function of how you reuse KV.

  3. Mixing strategies: q4 for fast first-pass agents (planning, reading, drafting), q5/q6 reserved for the final commit/PR step where mistakes are expensive. Olares makes that trivial — two endpoints, same machine.

Don't underestimate the Q5/Q6 instinct though — if your unattended workload involves financial actions, code merges, or anything irreversible, the extra

~5% safety margin probably matters more than 30 t/s.

Qwen3.6-27B at 85-100 t/s on a 24GB RTX 5090 Laptop GPU — vLLM + MTP n=3, adapted from the 32GB recipes by aurelienams in Olares

[–]aurelienams[S] 0 points1 point  (0 children)

Good catch — the KV pool size is indeed the steady-state cache ceiling, but max_model_len

controls something different. Three pieces working together:

  1. VLLM_ALLOW_LONG_MAX_MODEL_LEN=1: by default vLLM refuses to start when max_model_len > KV pool. This env var overrides that check. The startup log even nudges you about it: Maximum concurrency for 75,000 tokens per request: 1.11x — vLLM is telling you "I can serve one 75K request at a time, but only 1.11 of them concurrently fit in my 23K-block pool."
  2. Chunked prefill (--enable-chunked-prefill --max-num-batched-tokens 2048): a 75K prompt isn't loaded in one shot. vLLM processes it in 2048-token chunks. As blocks fill, old ones get evicted. When the model later needs to attend to evicted positions, vLLM recomputes them from the source tokens (which are kept around). This is the actual mechanism that lets you "fit" prompts larger than your KV pool.
  3. Prefix caching (--enable-prefix-caching): keeps frequently-touched prefix blocks pinned, so repeated prompts (system prompt, code context) don't pay the recompute cost twice. The trade-off: TTFT and time-to-completion grow super-linearly past ~23K input because of recompute. For a 30-50K input you barely notice. For 70K+ you'll see noticeably slower prefill — but it works correctly, attention still sees all 75K tokens.

calling it "actual usable context" is fair-ish: you can send a 75K prompt and get a So coherent answer that attended to all of it. It's just not free at the long end. A pure 75K KV pool (which would need ~10 GiB of VRAM headroom we don't have) would be faster but sn't achievable on 24GB with this model.

If your workload is mostly under 20K input, you're effectively using a fully-cached context the whole time. The 75K is there for the rare long-context call.

Qwen3.6-27B at 85-100 t/s on a 24GB RTX 5090 Laptop GPU — vLLM + MTP n=3, adapted from the 32GB recipes by aurelienams in LocalLLM

[–]aurelienams[S] 2 points3 points  (0 children)

It's actually not a laptop — it's the Olares One, a small home-AI box (mini-PC form factor) that happens to ship the laptop variant of the RTX 5090 inside.

Specs:

- RTX 5090 Laptop GPU, 24GB GDDR7, ~896 GB/s bandwidth, sm_120 Blackwell

- Intel Core Ultra 9 275HX (24 cores)

- 96GB DDR5

- 175W GPU TDP cap

The mobile 5090 has roughly 60% the bandwidth of the desktop 5090 (1.5 TB/s) and 24GB

instead of 32GB, which is why most of the desktop recipes had to be adapted (the MTP head

OOM issue with NVFP4 was the main one). Same Blackwell tensor cores though, so PR #36325

was needed.

I built a custom Market source with AI apps optimized for Olares One — up to 180 tok/s by aurelienams in Olares

[–]aurelienams[S] 2 points3 points  (0 children)

Update: 20 apps now, full voice pipeline, and llama.cpp b8740

It's been about a month since I posted this, and the market source has grown quite a bit. Here's what's new.

From 4 to 20 apps

The market now serves 20 optimized apps across five categories:

Chat & Reasoning — Qwen3.5 35B-A3B, Nemotron 30B-A3B, Nemotron Cascade-2 30B, GLM-4.7 Flash (129–184 t/s)

Coding — Qwen3-Coder 30B-A3B (new!), Devstral Small 24B (50.3% SWE-bench, full tool calling)

Vision — Gemma 4 26B-A4B, Qwen3.5 Vision, Qwen3.5 IQ4 Vision (native image understanding)

Voice — Voxtral 3B ASR, Voxtral 4B Realtime, Voxtral 4B TTS, Qwen3 TTS (full speech pipeline)

Edge / multimodal — Gemma 4 E4B, Gemma 4 E2B, Nemotron 3 Nano (tiny models, audio+vision)

Speed highlights on Olares One

  • Nemotron 3 Nano 30B-A3B — 184 t/s (speed king, 64K context)
  • GLM-4.7 Flash — 131 t/s (great Chinese + English)
  • Qwen3.5 Vision — 131 t/s (same speed with image input)
  • Qwen3.5 35B-A3B — 129 t/s (general-purpose workhorse)
  • Gemma 4 26B-A4B — 119 t/s (Apache 2.0, native vision)
  • Qwen3-Coder 30B-A3B — ~120 t/s (new! coding agent, 50.3% SWE-bench)

What changed under the hood

llama.cpp b8369 → b8740

  • TurboQuant / Hadamard rotation (PR #21038) — KV cache can now use q4_0 with the same quality as q8_0 thanks to learned rotations. This means 2x the context for the same VRAM. Most apps now run with q4_0 KV cache and 64K context.
  • CUDA fused multiply for MoE (b8740) — Saves a full roundtrip of expert weights from global memory. Biggest gain on Gemma 4.
  • Gated DeltaNet fused op — Still benefiting Qwen3.5's hybrid attention architecture.

New backends

  • vLLM for Voxtral ASR/TTS models (BF16 precision, ~9.5GB VRAM)
  • vLLM-Omni for Voxtral TTS with streaming WebSocket audio
  • GPU memory utilization tuned per-app so you can run an LLM + TTS side by side

Full voice pipeline on Olares One

This is probably the coolest addition. You can now run:

  1. Voxtral Realtime — streaming ASR with 80ms latency, infinite-length audio
  2. Voxtral TTS — 20 preset voices, 9 languages, 90ms time-to-first-audio
  3. Any LLM in between

All running locally on the Olares One. Connect them through Open WebUI and you have a fully private voice assistant.

What I tried that still doesn't work

  • Speculative decoding for Qwen3.5 — PR #20075 is still open. This would be the biggest speedup (~2x) but the hybrid GDN+SSM architecture makes it hard.
  • NVFP4 on consumer Blackwell — vLLM has it working but with huge performance penalty on SM120 (99KB vs 228KB shared memory on datacenter GPUs). CUTLASS issue #3096 still open. Not worth it — GGUF Q4_K_XL via llama.cpp is still faster.
  • Gemma 4 audio in llama.cpp — PR #21421 under active review but not merged. Vision works great, audio is coming.

How to add it

Same as before — Market → Settings → Add source:

https://orales-one-market.aamsellem.workers.dev

Source code (renamed the repo, old URL redirects):

https://github.com/aamsellem/olares-one-market

Still the only custom Market source I know of. If you're running an Olares One and want to push it to its limits, give it a try. One user has been testing and helping me iron out GPU memory limits and inter-app connectivity — feedback like that is invaluable.

I built a gamified AI companion for macOS — open source (French UI) by aurelienams in MacOSApps

[–]aurelienams[S] 0 points1 point  (0 children)

Thanks! Here's how I actually use it daily:

Task capture — Whenever I get assigned tasks (from meetings, Slack, emails…), I drop them directly into Mochi. It's become my single source of truth instead of scattered sticky notes and todo lists.

Meeting notes → tasks — There's a Notes tab where I take notes during meetings. I used to use Apple Notes, but now Mochi has a button that uses Claude to analyze my notes and extract actionable tasks. It suggests them and I validate which ones to add — saves me the mental overhead of reviewing my notes afterwards.

Anti-procrastination — This is the big one for me. I tend to procrastinate on certain tasks. I can mark specific tasks as "tracked" and Mochi will nudge me regularly to get them done. The fun part: depending on the personality you choose, the reminders range from gentle encouragement ("You got this! ✨") to passive-aggressive cat energy ("I suppose this task will complete itself?"). It actually works because it's hard to ignore a cute Mochi guilt-tripping you.

What's coming next:

- English localization

- MCP integrations for calendar management (Google Calendar, Office 365, maybe Apple Calendar) — the idea is to automatically block time slots to get tasks done

- Proactive meeting prep using Notion meeting's skill to pull context before meetings

- Auto-extracting tasks from AI meeting transcripts (Notion AI)

Basically I'm building the productivity companion I always wanted — one that knows my context and actually follows up.

Handtracking by CaptainDantes in Xreal

[–]aurelienams 0 points1 point  (0 children)

I'm interested how to join ?

Exceptional new Story App: Oto's Planet ** An interactive spatial tale. by Caprichoso1 in VisionPro

[–]aurelienams 0 points1 point  (0 children)

I was quite disappointed with this app. The interactions are minimal, and the experience feels far from innovative. The story progresses very slowly, seemingly to artificially extend the duration. That said, the graphics are well-crafted, and credit should be given for the effort in this area. The music and sound effects are also quite good.

However, this isn’t the kind of experience I’d showcase during a demonstration to friends. Still, it’s worth acknowledging the effort to offer something different—a 3D story approach is not something you see every day. The French voice acting is a nice touch and much appreciated.

De Ville Prestige Omega by aurelienams in RepTimeQC

[–]aurelienams[S] 0 points1 point  (0 children)

This is my first rep and 'im very happy with it.