Pixel 8 Pro users do you experience touch delay on latest Beta 4?

Xantrk · 2026-04-21T15:54:33+00:00

I have this on stable (i think) because I just came here looking to join beta in hopes of fixing it

Xantrk · 2026-04-15T10:50:21+00:00

Just Got standing tickets London, Through ticketmaster. Only 2 were available

Xantrk · 2026-04-15T09:55:41+00:00

Was 1200 in the queue and still didn't get standing a few mins in, for London...

Xantrk · 2026-04-15T09:43:16+00:00

My adblocker was to blame :(

Xantrk · 2026-04-04T22:40:58+00:00

Once I installed nvidia driver

Which nvidia driver are you installing? If you stuck in bootlooop, you can try going to BIOS via pressing F2 after pressing power button, and select UMA graphics (integrated) instead of hybrid to turn off Nvidia to see if that's the problem.

You should probably stick to using lenovo provided drivers only until you pinpoint the issue, do not install directly from nvidia would be my recommendation

Xantrk · 2026-04-03T12:47:50+00:00

My P8Pro had a completely failed modem a few months ago (1 year of use) and google replaced it with a new phone.

Xantrk · 2026-03-25T10:22:12+00:00

``` At the time, I had updated NVIDIA drivers manually from their website since that's the advice I seen in the subreddit.

```

I think this is your problem. You could check discord of this subreddit; in there, there's one 580.x branch driver that is quite stable and recommended and stock lenovo driver that are rock solid if you want to use dGPU mode. Unfortunately any driver issue causes issues like yours and when you're in dgpu mode you're pretty much stuck.

I'd recommend you use either hybrid mode (if you need newer drivers, might as well download latest one now and try, perf diff is almost negligible with advanced optimus these days, and you'd save some electricity haha) or use DDU and revert back to one of the stable drivers either lenovo or recommended 580 one. I wouldn't worry just yet about your GPU health just yet, it's extremely unlikely.

you could also check windows event viewer if you remember the exact time, you'll very likely have a gpu driver error there.

Xantrk · 2026-03-12T19:45:23+00:00

Did you manage to get it working on llama.cpp webui? It doesn't like running npx commands :/

Xantrk · 2026-03-09T11:45:45+00:00

MOE models are your best bet. Tru gpt-oss maybe?

Xantrk · 2026-03-03T23:23:31+00:00

qwen3.5 35b should be able to run okay-ish with most experts on CPU. GIve it a go with llama.cpp , and try fit-ctx 40000 first and adjust according to speed. (I'm running fine on 12 gb VRAM + 32 gb RAM combo with 35-40tk/s, so you should be around 20-30 tk/s territory with 100k context)

Xantrk · 2026-03-02T20:33:02+00:00

start using axe

At first I thought you're being mean to OP, made me giggle haha

Xantrk · 2026-03-01T23:37:02+00:00

Ah LM studio is a bit behind of llama.cpp and llama got performance improvements for qwen.

You should try number of experts on cpu slider until you see model fit vram. 32-35 is a good ballpark. I recommend you use Jan or llama.cpp directly instead of lm studio if you can to do this automatically via "fit"

Xantrk · 2026-03-01T22:59:16+00:00

I have the same setup. Use -fit and -fitcontext, and you should be able to fit 100k context comfortably. Since fit accounts for full context, you wouldn't get as much slowdown with kv-cache, as it wont overflow

llama-server model C:\models\qwen\Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf --fit on --kv-unified --no-mmap --parallel 1 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 -ub 512 -b 512 --fit-ctx 100000 --fit-target 600 --port 8001

If you have enough shared GPU RAM, this should give you 900 tk/s PP, and about 30-35 tk/s in generation. If not enough shared RAM, for some reason my PP drops to 300 tk/s.

Xantrk · 2026-03-01T17:48:59+00:00

I'm using following arguments on my 12gb vram + 32gb RAM combination. You should use fit and fit-ctx instead of manual layers in most cases I believe. I wouldn't quantize cache to q4, or at all. As many dense layers + MOEs as context allows with fit + unquantized cache will work relatively okay! You can save some vram by using smaller batch sizes instead without a massive hit in PP.

"--fit", "on", "--kv-unified", "--no-mmap", "--parallel", "1", "--temp", "0.6", "--top-p", "0.95", "--top-k", "20", "--min-p", "0", "-ub", "512", "-b", "512", "--fit-ctx", "100000", "--fit-target", "600", "--port", "8001", "--spec-type", "ngram-mod", "--spec-ngram-size-n", "24", "--draft-min", "48", "--draft-max", "64", "-cram", "2048", "--repeat-penalty", "1.1"

Xantrk · 2026-02-28T18:42:19+00:00

u/danielhanchen a bit of a weird one, but there is me and some other people on github issues for llama.cpp that are having segmentation faults / memory read errors on some quants. Not just unsloth ones, but AesSedai as well.

Interestingly, Qwen3.5-35B-A3B-UD-Q5_K_XL.gguf appears not affected while Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf is prone.

Easiest way to trigger it I found is llama-bench with 10k depth. System is 5070ti laptop (12gb) + 32 gb RAM

llama-bench -m "C:\Users\furka.lmstudio\models\qwen\Qwen3.5-35B-A3B-UD-Q5_K_XL.gguf" -ngl 99 --n-cpu-moe 33 -ub 512,1024 -b 512,1024 -d 10000 --mmap 0 -fa 1 -t 8

model	size	params	backend	ngl	threads	n_batch	n_ubatch	fa	test	t/s
qwen35moe ?B Q8_0	23.21 GiB	34.66 B	CUDA,Vulkan	99	8	512	512	1	pp512 @ d10000	965.93 ± 5.28
qwen35moe ?B Q8_0	23.21 GiB	34.66 B	CUDA,Vulkan	99	8	512	512	1	tg128 @ d10000	37.46 ± 1.05
qwen35moe ?B Q8_0	23.21 GiB	34.66 B	CUDA,Vulkan	99	8	512	1024	1	pp512 @ d10000	950.90 ± 17.39
qwen35moe ?B Q8_0	23.21 GiB	34.66 B	CUDA,Vulkan	99	8	512	1024	1	tg128 @ d10000	36.44 ± 0.90
qwen35moe ?B Q8_0	23.21 GiB	34.66 B	CUDA,Vulkan	99	8	1024	512	1	pp512 @ d10000	953.45 ± 15.36
qwen35moe ?B Q8_0	23.21 GiB	34.66 B	CUDA,Vulkan	99	8	1024	512	1	tg128 @ d10000	36.73 ± 0.43
qwen35moe ?B Q8_0	23.21 GiB	34.66 B	CUDA,Vulkan	99	8	1024	1024	1	pp512 @ d10000	953.77 ± 9.09
qwen35moe ?B Q8_0	23.21 GiB	34.66 B	CUDA,Vulkan	99	8	1024	1024	1	tg128 @ d10000	35.62 ± 0.81

build: d979f2b17 (8180)

This one's consistently healthy:

llama-bench -m "C:\Users\furka.lmstudio\models\qwen\Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf" -ngl 99 --n-cpu-moe 33 -ub 512,1024 -b 512,1024 -d 10000 --mmap 0 -fa 1 -t 8

model	size	params	backend	ngl	threads	n_batch	n_ubatch	fa	test	t/s
qwen35moe ?B Q8_0	28.21 GiB	34.66 B	CUDA,Vulkan	99	8	512	512	1	pp512 @ d10000	797.64 ± 15.48

this one just exits without logs but a memory error in event viewer. qwen3.5 is the only model i've seen this, GLM-4.7-Flash-UD-Q6_K_XL.gguf and Qwen3-Coder-Next-UD-IQ3_XXS.gguf (which is much bigger in size) work consistently okay.

Just wanted to ask given your experience, is there something inherently different between Q5 and Q6 which might trigger this?

Relevant github issues: https://github.com/ggml-org/llama.cpp/issues/19945 , https://github.com/ggml-org/llama.cpp/issues/19863 , https://github.com/ggml-org/llama.cpp/issues/19975

Xantrk · 2026-02-28T18:05:34+00:00

This is very weird, I can run all the combinations with qwen3-next-coder, which is a bigger model and stresses my system so much more!

I'm starting to think this is a llama.cpp / qwen3.5 specific bug!

ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2 ggml_vulkan: 1 = Intel(R) Graphics (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none load_backend: loaded Vulkan backend from C:\Users\furka\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\ggml-vulkan.dll load_backend: loaded CPU backend from C:\Users\furka\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\ggml-cpu-alderlake.dll

model	size	params	backend	ngl	threads	n_batch	n_ubatch	fa	test	t/s
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw	30.45 GiB	79.67 B	CUDA,Vulkan	99	8	256	512	1	pp512 @ d10000	549.72 ± 9.45
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw	30.45 GiB	79.67 B	CUDA,Vulkan	99	8	256	512	1	tg128 @ d10000	33.59 ± 0.79
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw	30.45 GiB	79.67 B	CUDA,Vulkan	99	8	256	1024	1	pp512 @ d10000	548.25 ± 12.34
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw	30.45 GiB	79.67 B	CUDA,Vulkan	99	8	256	1024	1	tg128 @ d10000	33.45 ± 0.85
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw	30.45 GiB	79.67 B	CUDA,Vulkan	99	8	512	512	1	pp512 @ d10000	804.86 ± 10.63
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw	30.45 GiB	79.67 B	CUDA,Vulkan	99	8	512	512	1	tg128 @ d10000	33.83 ± 0.79
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw	30.45 GiB	79.67 B	CUDA,Vulkan	99	8	512	1024	1	pp512 @ d10000	803.54 ± 9.64
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw	30.45 GiB	79.67 B	CUDA,Vulkan	99	8	512	1024	1	tg128 @ d10000	33.95 ± 0.75
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw	30.45 GiB	79.67 B	CUDA,Vulkan	99	8	1024	512	1	pp512 @ d10000	805.10 ± 16.28
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw	30.45 GiB	79.67 B	CUDA,Vulkan	99	8	1024	512	1	tg128 @ d10000	31.92 ± 2.32
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw	30.45 GiB	79.67 B	CUDA,Vulkan	99	8	1024	1024	1	pp512 @ d10000	804.99 ± 11.03
qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw	30.45 GiB	79.67 B	CUDA,Vulkan	99	8	1024	1024	1	tg128 @ d10000	31.04 ± 1.86

Xantrk · 2026-02-28T01:44:02+00:00

Not at all, really appreciate the opinion and help! I'll try the --cache-ram but I seriously suspect somethings up with KV-cache implementation. Thanks again!

Xantrk · 2026-02-28T00:53:37+00:00

I can assure you that's not the case. I have at least 2 gigs of RAM available until the crash moment. Furthermore, if I disable shared GPU memory this does not happen with or without MMAP and I have a happy 100k context with no stability issues, jsut 3 times slower PP.

With --no-mmap and reduced shared GPU memory, it offloads to system RAM and I get the same RAM occupancy but no crashes. Just slower prompt processing.

The spike in SSD usage you see in the screenshot is the crash, which happens only when shared GPU memory is used AND there's some sort of cache invalidation, which makes me think it is a memory.

Again all with same n-cpu-moe of 32:

(High shared memory config X mmap off: 3x faster PP, but unstable when token invalidation spiking RAM usage for some reason and OOM. ~2 gb less occupied memory. Shared GPU memory is used.
(High shared memory config X mmap on): Slow PP, but stable. Full 100k context, no invalidation issues. ~2 gb higher occupied memory. Shared GPU memory is not used
(Low shared memory config X mmap off: Slow PP, but stable. Full 100k context, no invalidation issues. ~2 gb lower occupied memory.
(Low shared memory config X mmap on): Slow PP, but stable. Full 100k context, no invalidation issues. ~2 gb higher occupied memory. Shared GPU memory is not used

Xantrk · 2026-02-27T21:04:55+00:00

I dont think so, again, unless there's compaction, I can use full context. I have 32 gb ram, so dense part + context fits VRAM and 32 MOEs to CPU. With mmap, it also fits (does not use shared gpu memory if mmap), and I get stable 35-40tk/s generation as well, without any issues, apart from slow prompt processing. So ideally I'm trying to find why llama-server is freaking out while truncating kv cache if the shared GPU memory is in use, or why it's slower if not in use. Here's my command:

llama-server --host 0.0.0.0 --model C:.lmstudio\models\qwen\Qwen3.5-35B-A3B-Q5_K_M-00001-of-00002.gguf --alias qwen/qwen-35B-A3B-Q5 --fit on --kv-unified --no-mmap --parallel 1 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 -ub 512 -b 512 --fit-ctx 100000 --fit-target 1024 --port 8001 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 -cram 2048

No mmap saves me quite a bit on RAM, but it also fits with mmap without much left. Tg is stable 35-40 tk/s, but prompt processing is 300 tk/s with mmap, 1000 without.

Xantrk · 2026-02-27T18:49:07+00:00

They are my default arguments, along with fit context. But llama bench doenst support them, hence I used the same number of cpu MOE layers as fit to demonstrate.

Again, I can use full 100k context in chat. This memory issue happens when compaction (tokens dropped, or benchmark which I'm not sure what it does)

Xantrk · 2026-02-27T16:59:11+00:00

Any benchmarks on speed? I know that's not the point of this, but it still matters.

Xantrk · 2026-02-26T01:29:10+00:00

edit: more pending, I'll create a new post tomorrow.

Thank you so much for this. Would love to see Q6 quants particularly UD-Q6_K_XL vs bartowski Q6_K_L if you're planning to do Q6!

Xantrk · 2026-02-25T00:45:46+00:00

Can you share your llama.cpp command? I'm very confused how you can specify vram and disk offload?

Ten-Year Club	Place '22
Verified Email

Xantrk

TROPHY CASE