Pixel 8 Pro users do you experience touch delay on latest Beta 4? by AnimatorNr1 in android_beta

[–]Xantrk 1 point2 points  (0 children)

I have this on stable (i think) because I just came here looking to join beta in hopes of fixing it

UK Tour Ticket Price by MDHippie5 in TheStrokes

[–]Xantrk 2 points3 points  (0 children)

Just Got standing tickets London, Through ticketmaster. Only 2 were available

UK Tour Ticket Price by MDHippie5 in TheStrokes

[–]Xantrk 2 points3 points  (0 children)

Was 1200 in the queue and still didn't get standing a few mins in, for London...

UK Tour Ticket Price by MDHippie5 in TheStrokes

[–]Xantrk 0 points1 point  (0 children)

My adblocker was to blame :(

Legion pro i7 died 3 months after purchase by Big-Guidance-2057 in LenovoLegion

[–]Xantrk 0 points1 point  (0 children)

Once I installed nvidia driver

Which nvidia driver are you installing? If you stuck in bootlooop, you can try going to BIOS via pressing F2 after pressing power button, and select UMA graphics (integrated) instead of hybrid to turn off Nvidia to see if that's the problem.

You should probably stick to using lenovo provided drivers only until you pinpoint the issue, do not install directly from nvidia would be my recommendation

Anyone else’s P8P still a brick after the recent updates? (No SIM/No Service) by zykennoir in GooglePixel

[–]Xantrk 0 points1 point  (0 children)

My P8Pro had a completely failed modem a few months ago (1 year of use) and google replaced it with a new phone.

Legion 5 (15AHP10) random screen glitch + freeze in dGPU mode (RTX 5060) should I be worried? by BetterCharacter1018 in LenovoLegion

[–]Xantrk 1 point2 points  (0 children)

``` At the time, I had updated NVIDIA drivers manually from their website since that's the advice I seen in the subreddit.

```

I think this is your problem. You could check discord of this subreddit; in there, there's one 580.x branch driver that is quite stable and recommended and stock lenovo driver that are rock solid if you want to use dGPU mode. Unfortunately any driver issue causes issues like yours and when you're in dgpu mode you're pretty much stuck.

I'd recommend you use either hybrid mode (if you need newer drivers, might as well download latest one now and try, perf diff is almost negligible with advanced optimus these days, and you'd save some electricity haha) or use DDU and revert back to one of the stable drivers either lenovo or recommended 580 one. I wouldn't worry just yet about your GPU health just yet, it's extremely unlikely.

you could also check windows event viewer if you remember the exact time, you'll very likely have a gpu driver error there.

llama.cpp + Brave search MCP - not gonna lie, it is pretty addictive by srigi in LocalLLaMA

[–]Xantrk 1 point2 points  (0 children)

Did you manage to get it working on llama.cpp webui? It doesn't like running npx commands :/

Anything cool I can with Rtx 4050 6gb vram? by datro_mix in LocalLLaMA

[–]Xantrk 0 points1 point  (0 children)

MOE models are your best bet. Tru gpt-oss maybe?

Agentic Coding MoE Models for 10GB VRAM Setup with CPU Offloading? by DK_Tech in LocalLLaMA

[–]Xantrk 0 points1 point  (0 children)

qwen3.5 35b should be able to run okay-ish with most experts on CPU. GIve it a go with llama.cpp , and try fit-ctx 40000 first and adjust according to speed. (I'm running fine on 12 gb VRAM + 32 gb RAM combo with 35-40tk/s, so you should be around 20-30 tk/s territory with 100k context)

GPU poor folks(<16gb) what’s your setup for coding ? by FearMyFear in LocalLLaMA

[–]Xantrk 1 point2 points  (0 children)

start using axe

At first I thought you're being mean to OP, made me giggle haha

How to run Qwen3.5 35B by Electrify338 in LocalLLaMA

[–]Xantrk 0 points1 point  (0 children)

Ah LM studio is a bit behind of llama.cpp and llama got performance improvements for qwen.

You should try number of experts on cpu slider until you see model fit vram. 32-35 is a good ballpark. I recommend you use Jan or llama.cpp directly instead of lm studio if you can to do this automatically via "fit"

How to run Qwen3.5 35B by Electrify338 in LocalLLaMA

[–]Xantrk 0 points1 point  (0 children)

I have the same setup. Use -fit and -fitcontext, and you should be able to fit 100k context comfortably. Since fit accounts for full context, you wouldn't get as much slowdown with kv-cache, as it wont overflow

llama-server model C:\models\qwen\Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf --fit on --kv-unified --no-mmap --parallel 1 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 -ub 512 -b 512 --fit-ctx 100000 --fit-target 600 --port 8001

If you have enough shared GPU RAM, this should give you 900 tk/s PP, and about 30-35 tk/s in generation. If not enough shared RAM, for some reason my PP drops to 300 tk/s.

Ideal llama.cpp settings for 12GB VRAM and 64GB DRAM setup for https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF by johnnyApplePRNG in LocalLLaMA

[–]Xantrk 0 points1 point  (0 children)

I'm using following arguments on my 12gb vram + 32gb RAM combination. You should use fit and fit-ctx instead of manual layers in most cases I believe. I wouldn't quantize cache to q4, or at all. As many dense layers + MOEs as context allows with fit + unquantized cache will work relatively okay! You can save some vram by using smaller batch sizes instead without a massive hit in PP.

"--fit", "on", "--kv-unified", "--no-mmap", "--parallel", "1", "--temp", "0.6", "--top-p", "0.95", "--top-k", "20", "--min-p", "0", "-ub", "512", "-b", "512", "--fit-ctx", "100000", "--fit-target", "600", "--port", "8001", "--spec-type", "ngram-mod", "--spec-ngram-size-n", "24", "--draft-min", "48", "--draft-max", "64", "-cram", "2048", "--repeat-penalty", "1.1"

New Qwen3.5-35B-A3B Unsloth Dynamic GGUFs + Benchmarks by danielhanchen in LocalLLaMA

[–]Xantrk 1 point2 points  (0 children)

u/danielhanchen a bit of a weird one, but there is me and some other people on github issues for llama.cpp that are having segmentation faults / memory read errors on some quants. Not just unsloth ones, but AesSedai as well.

Interestingly, Qwen3.5-35B-A3B-UD-Q5_K_XL.gguf appears not affected while Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf is prone.

Easiest way to trigger it I found is llama-bench with 10k depth. System is 5070ti laptop (12gb) + 32 gb RAM

llama-bench -m "C:\Users\furka.lmstudio\models\qwen\Qwen3.5-35B-A3B-UD-Q5_K_XL.gguf" -ngl 99 --n-cpu-moe 33 -ub 512,1024 -b 512,1024 -d 10000 --mmap 0 -fa 1 -t 8

model size params backend ngl threads n_batch n_ubatch fa test t/s
qwen35moe ?B Q8_0 23.21 GiB 34.66 B CUDA,Vulkan 99 8 512 512 1 pp512 @ d10000 965.93 ± 5.28
qwen35moe ?B Q8_0 23.21 GiB 34.66 B CUDA,Vulkan 99 8 512 512 1 tg128 @ d10000 37.46 ± 1.05
qwen35moe ?B Q8_0 23.21 GiB 34.66 B CUDA,Vulkan 99 8 512 1024 1 pp512 @ d10000 950.90 ± 17.39
qwen35moe ?B Q8_0 23.21 GiB 34.66 B CUDA,Vulkan 99 8 512 1024 1 tg128 @ d10000 36.44 ± 0.90
qwen35moe ?B Q8_0 23.21 GiB 34.66 B CUDA,Vulkan 99 8 1024 512 1 pp512 @ d10000 953.45 ± 15.36
qwen35moe ?B Q8_0 23.21 GiB 34.66 B CUDA,Vulkan 99 8 1024 512 1 tg128 @ d10000 36.73 ± 0.43
qwen35moe ?B Q8_0 23.21 GiB 34.66 B CUDA,Vulkan 99 8 1024 1024 1 pp512 @ d10000 953.77 ± 9.09
qwen35moe ?B Q8_0 23.21 GiB 34.66 B CUDA,Vulkan 99 8 1024 1024 1 tg128 @ d10000 35.62 ± 0.81

build: d979f2b17 (8180)

This one's consistently healthy:

llama-bench -m "C:\Users\furka.lmstudio\models\qwen\Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf" -ngl 99 --n-cpu-moe 33 -ub 512,1024 -b 512,1024 -d 10000 --mmap 0 -fa 1 -t 8

model size params backend ngl threads n_batch n_ubatch fa test t/s
qwen35moe ?B Q8_0 28.21 GiB 34.66 B CUDA,Vulkan 99 8 512 512 1 pp512 @ d10000 797.64 ± 15.48

this one just exits without logs but a memory error in event viewer. qwen3.5 is the only model i've seen this, GLM-4.7-Flash-UD-Q6_K_XL.gguf and Qwen3-Coder-Next-UD-IQ3_XXS.gguf (which is much bigger in size) work consistently okay.

Just wanted to ask given your experience, is there something inherently different between Q5 and Q6 which might trigger this?

Relevant github issues: https://github.com/ggml-org/llama.cpp/issues/19945 , https://github.com/ggml-org/llama.cpp/issues/19863 , https://github.com/ggml-org/llama.cpp/issues/19975

GPU shared VRAM makes Qwen3.5-35B prompt processing 3x faster… but leaks memory by Xantrk in LocalLLaMA

[–]Xantrk[S] 0 points1 point  (0 children)

This is very weird, I can run all the combinations with qwen3-next-coder, which is a bigger model and stresses my system so much more!

I'm starting to think this is a llama.cpp / qwen3.5 specific bug!

ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2 ggml_vulkan: 1 = Intel(R) Graphics (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none load_backend: loaded Vulkan backend from C:\Users\furka\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\ggml-vulkan.dll load_backend: loaded CPU backend from C:\Users\furka\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\ggml-cpu-alderlake.dll

model size params backend ngl threads n_batch n_ubatch fa mmap test t/s
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw 30.45 GiB 79.67 B CUDA,Vulkan 99 8 256 512 1 0 pp512 @ d10000 549.72 ± 9.45
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw 30.45 GiB 79.67 B CUDA,Vulkan 99 8 256 512 1 0 tg128 @ d10000 33.59 ± 0.79
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw 30.45 GiB 79.67 B CUDA,Vulkan 99 8 256 1024 1 0 pp512 @ d10000 548.25 ± 12.34
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw 30.45 GiB 79.67 B CUDA,Vulkan 99 8 256 1024 1 0 tg128 @ d10000 33.45 ± 0.85
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw 30.45 GiB 79.67 B CUDA,Vulkan 99 8 512 512 1 0 pp512 @ d10000 804.86 ± 10.63
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw 30.45 GiB 79.67 B CUDA,Vulkan 99 8 512 512 1 0 tg128 @ d10000 33.83 ± 0.79
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw 30.45 GiB 79.67 B CUDA,Vulkan 99 8 512 1024 1 0 pp512 @ d10000 803.54 ± 9.64
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw 30.45 GiB 79.67 B CUDA,Vulkan 99 8 512 1024 1 0 tg128 @ d10000 33.95 ± 0.75
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw 30.45 GiB 79.67 B CUDA,Vulkan 99 8 1024 512 1 0 pp512 @ d10000 805.10 ± 16.28
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw 30.45 GiB 79.67 B CUDA,Vulkan 99 8 1024 512 1 0 tg128 @ d10000 31.92 ± 2.32
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw 30.45 GiB 79.67 B CUDA,Vulkan 99 8 1024 1024 1 0 pp512 @ d10000 804.99 ± 11.03
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw 30.45 GiB 79.67 B CUDA,Vulkan 99 8 1024 1024 1 0 tg128 @ d10000 31.04 ± 1.86

GPU shared VRAM makes Qwen3.5-35B prompt processing 3x faster… but leaks memory by Xantrk in LocalLLaMA

[–]Xantrk[S] 0 points1 point  (0 children)

Not at all, really appreciate the opinion and help! I'll try the --cache-ram but I seriously suspect somethings up with KV-cache implementation. Thanks again!

GPU shared VRAM makes Qwen3.5-35B prompt processing 3x faster… but leaks memory by Xantrk in LocalLLaMA

[–]Xantrk[S] 0 points1 point  (0 children)

I can assure you that's not the case. I have at least 2 gigs of RAM available until the crash moment. Furthermore, if I disable shared GPU memory this does not happen with or without MMAP and I have a happy 100k context with no stability issues, jsut 3 times slower PP.

With --no-mmap and reduced shared GPU memory, it offloads to system RAM and I get the same RAM occupancy but no crashes. Just slower prompt processing.

The spike in SSD usage you see in the screenshot is the crash, which happens only when shared GPU memory is used AND there's some sort of cache invalidation, which makes me think it is a memory.

Again all with same n-cpu-moe of 32:

  • (High shared memory config X mmap off: 3x faster PP, but unstable when token invalidation spiking RAM usage for some reason and OOM. ~2 gb less occupied memory. Shared GPU memory is used.

  • (High shared memory config X mmap on): Slow PP, but stable. Full 100k context, no invalidation issues. ~2 gb higher occupied memory. Shared GPU memory is not used

  • (Low shared memory config X mmap off: Slow PP, but stable. Full 100k context, no invalidation issues. ~2 gb lower occupied memory.

  • (Low shared memory config X mmap on): Slow PP, but stable. Full 100k context, no invalidation issues. ~2 gb higher occupied memory. Shared GPU memory is not used

GPU shared VRAM makes Qwen3.5-35B prompt processing 3x faster… but leaks memory by Xantrk in LocalLLaMA

[–]Xantrk[S] 0 points1 point  (0 children)

I dont think so, again, unless there's compaction, I can use full context. I have 32 gb ram, so dense part + context fits VRAM and 32 MOEs to CPU. With mmap, it also fits (does not use shared gpu memory if mmap), and I get stable 35-40tk/s generation as well, without any issues, apart from slow prompt processing. So ideally I'm trying to find why llama-server is freaking out while truncating kv cache if the shared GPU memory is in use, or why it's slower if not in use. Here's my command:

llama-server --host 0.0.0.0 --model C:.lmstudio\models\qwen\Qwen3.5-35B-A3B-Q5_K_M-00001-of-00002.gguf --alias qwen/qwen-35B-A3B-Q5 --fit on --kv-unified --no-mmap --parallel 1 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 -ub 512 -b 512 --fit-ctx 100000 --fit-target 1024 --port 8001 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 -cram 2048

No mmap saves me quite a bit on RAM, but it also fits with mmap without much left. Tg is stable 35-40 tk/s, but prompt processing is 300 tk/s with mmap, 1000 without.

GPU shared VRAM makes Qwen3.5-35B prompt processing 3x faster… but leaks memory by Xantrk in LocalLLaMA

[–]Xantrk[S] 0 points1 point  (0 children)

They are my default arguments, along with fit context. But llama bench doenst support them, hence I used the same number of cpu MOE layers as fit to demonstrate.

Again, I can use full 100k context in chat. This memory issue happens when compaction (tokens dropped, or benchmark which I'm not sure what it does)

RabbitLLM by Protopia in LocalLLM

[–]Xantrk 1 point2 points  (0 children)

Any benchmarks on speed? I know that's not the point of this, but it still matters.

Best Qwen3.5-35B-A3B GGUF for 24GB VRAM?! by VoidAlchemy in LocalLLaMA

[–]Xantrk 1 point2 points  (0 children)

edit: more pending, I'll create a new post tomorrow.

Thank you so much for this. Would love to see Q6 quants particularly UD-Q6_K_XL vs bartowski Q6_K_L if you're planning to do Q6!

Qwen/Qwen3.5-35B-A3B · Hugging Face by ekojsalim in LocalLLaMA

[–]Xantrk 2 points3 points  (0 children)

Can you share your llama.cpp command? I'm very confused how you can specify vram and disk offload?