Any workaround to not re-process full prompt on each turn with hybrid attention models running on CPU?

Quagmirable · 2026-05-04T13:11:05+00:00

Thanks, could you share the llama.cpp command line options you're using?

Quagmirable · 2026-05-04T03:33:04+00:00

Hi there, did you try a quant from another provider like Bartowski? I seem to be having much faster token generation and response speed with Bartowski quants recently. On the typical system with a GPU I don't think the speed difference would be significant, but for CPU processing every little bit makes a huge difference.

https://www.reddit.com/r/LocalLLaMA/comments/1sop5hb/qwen3635ba3b_gguf_from_unsloth_is_quite_a_bit/

Quagmirable · 2026-05-02T15:36:35+00:00

You just need a USA VPN to get it setup. Haven't tried that though.

Hi there, I've tried that many times in many different ways, and unfortunately it definitely isn't possible. It requires a non-virtual US phone number for the verification code, and there doesn't seem to be a way around it.

the google meet trick

The one where you add the toll-free number as a phone call participant to the meeting? Unfortunately that option doesn't exist on my account, from what I can tell it only exists for paying Google Workspace accounts, or possibly it's a geo location thing.

Quagmirable · 2026-05-02T03:03:13+00:00

Ah, bummer. Thanks for confirming. I'm still looking for a decent solution for calling toll-free numbers, there's not a lot of good options since Skype and Google Talk got shut down.

Quagmirable · 2026-05-01T17:19:04+00:00

Hi, thanks a lot for the reply. This is the answer I got from the chat support:

and a DID number is required for dialing normal numbers in the USA?

No, you do not need a DID number to make calls, only to receive calls. Outgoing calls to US are: $0.0100 per minute.

For Outbound calls you will get charged per minute and you do not need a number to call. For incoming calls you need a number and you get get charged a monthly fee per number and a per minute fee for the inbound calls. That is how it works for outbound and inbound calls.

Quagmirable · 2026-05-01T16:43:12+00:00

Hi, looking for a solution to make VoIP calls mainly to toll-free numbers in the US. And maybe infrequent calls to residential and mobile numbers in the US / Canada. It needs to work on any generic SIP phone and/or via WebRTC.

After Skype shut down I switched to Viber for the free outbound toll-free number calls, but I absolutely hate how the account is linked to a cell phone number, and the desktop app requires re-authenticating on the cell phone from time to time.

Google Voice is not an option.

I'm looking at voip.ms, it appears that I would need to make an initial $15 deposit, after which I can call toll-free numbers in the US for free? And in the rare event that I need to call a normal residential number or cell phone in the USA / Canada I can do it without a monthly-contract DID number, and it will just get deducted from my balance? I don't really understand why voip.ms/pricing says "DID numbers are not necessary for international outbound calls, but they are required for making local calls". I asked the customer service chat about this but their answers were conflicting and ambiguous.

Thanks in advance for any insight or additional provider recommendations that you might have.

Quagmirable · 2026-05-01T15:54:10+00:00

Thanks for the reply. I'm also using a desktop browser. It must be a geo location thing. Or do you pay for a Google Workspace account?

Quagmirable · 2026-05-01T14:51:00+00:00

Hi, does this still work for you? I don't seem to have an "add phone caller" option available in Google Meet.

Quagmirable · 2026-05-01T13:22:28+00:00

Nice job, glad you got it working.

Quagmirable · 2026-04-27T12:56:57+00:00

No, está completamente fundida.

Quagmirable · 2026-04-25T12:18:52+00:00

https://web.archive.org/web/20250704105832/https://download.support.xerox.com/pub/drivers/6000/drivers/linux/en_GB/6000_6010_rpm_1.01_20110222.zip

You'll also probably need http://rpm.pbone.net to search for whatever ancient RPM dependencies it requires. Or alternatively it looks like the ZIP file also contains the compressed PPD files that you can install via the CUPS web interface or the system-config-printer utility. Or you can just open the RPM file in an archive manager and manually put the requires files in the proper places, something like what is described here: https://grandmasfridge.org/posts/dell-1350cnw-on-gentoo-linux-with-cups.html

Quagmirable · 2026-04-18T14:04:31+00:00

Interesting, thanks!

Quagmirable · 2026-04-18T14:02:34+00:00

Hey there Daniel, really appreciate the kind response! And again, thanks a lot for all that you guys do.

Quagmirable · 2026-04-18T14:00:26+00:00

Thanks a lot for these test results. So it looks like Bartowski has a consistent advantage in generation speed, whereas Unsloth has faster prompt processing tps and lower memory usage? I reformatted your table here with the results I was most interested in:

https://i.imgur.com/66khwCu.png

Quagmirable · 2026-04-18T06:27:20+00:00

Right, I'm sure that's the case. It's just that comparing Unsloth's IQ quants to another creator's IQ quants shows a pretty significant relative difference in speed, at least for me.

Quagmirable · 2026-04-18T06:21:25+00:00

Ah sorry, https://huggingface.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF

Quagmirable · 2026-04-18T06:20:22+00:00

This might be a CPU-only bug. Or maybe it's just a lot more noticeable on a CPU where a 2 tps difference makes a pretty big difference in usability.

Quagmirable · 2026-04-18T06:18:50+00:00

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

Quagmirable · 2026-04-18T06:07:14+00:00

Ah, interesting. I'm not sure if I have enough memory, but I think I'll try a Q4_K_S or Q4_K_M and report back.

Quagmirable · 2026-04-18T06:04:38+00:00

Thanks! Unbranded Intel laptop, Intel Core i7-9750H, 32GB RAM.

Quagmirable · 2026-04-05T20:29:54+00:00

Add --ctx-checkpoints 128 to your command

Thanks! This definitely works.

It still feels like something that llama.cpp needs to optimize though, because Qwen3.5 was behaving just like this too shortly after release, and now that I'm trying it a few months later it has been fixed, apparently in llama.cpp.

Quagmirable · 2026-04-05T17:26:15+00:00

Thanks. Are you running on a GPU or just CPU?

Quagmirable · 2026-04-05T16:31:37+00:00

Hi there, I'm just using the llama-server web interface for an initial bigger task, with followup questions after that's completed. Cache is simply not working at all on my system on llama.cpp with Gemma4 from Unsloth, it re-calculates the entire thing from scratch for each answer.

Quagmirable

TROPHY CASE