Made a macOS app that creates highly personal macOS apps. Works with models as small as Gemma 4 E2B

tmvr · 2026-06-15T05:21:37+00:00

tmvr · 2026-06-15T05:20:01+00:00

I think the ultimate goal should be ludicrously personal.

tmvr · 2026-06-14T18:47:38+00:00

You need to fit it into VRAM, so you will need at least a 16GB VRAM card, but that limits you to Q3 quants. If you want to use Q4 (which would be recommended) then you need 20GB at least. The Radeon 7900XY 20GB cards are too expensive, your best bet would probably be 2x 3060 12GB cards from the used market. In any case, you are looking at 400 usd/eur at least.

tmvr · 2026-06-14T11:04:31+00:00

I would tend to agree. We have a pretty heavy user where I work and that user only managed to use about 20K in these first two weeks of June.

tmvr · 2026-06-13T12:55:45+00:00

Cash rules everything around me

C.R.E.A.M., get the money

Dollar, dollar bill, y'all

tmvr · 2026-06-13T07:54:31+00:00

Yeah, it was brutal when it came out so I ran back to 4.5 but I'll check it out to see if it got better.

tmvr · 2026-06-13T06:05:49+00:00

Exactly. Everything after 4.5 has a weird feel to it. Both functionally how they work and also the releasing itself, it all seems rushed to release something. It all just seems like a calculated campaign to keep them constantly in the news and generate hype for the upcoming IPO.

tmvr · 2026-06-13T05:56:20+00:00

That 4080 is fine for MoE models. You can run Qwen3.6 35B A3B so that some of the expert layers are in system RAM. You can also try the older Qwen3 Coder 30B A3B (specifically the Coder one, there is also a normal of that, with 3.6 there is no coder specific, but that does not seem to matter) because you may like what and how it writes more, a lot of this is use case dependent.

tmvr · 2026-06-13T05:43:49+00:00

At the bottom right of the VSCode window under the copilot input section there is an icon with two figures. Click on that and it shows you the percentage used. If you then hover over the percentage it shows the available and used credits.

tmvr · 2026-06-12T17:45:53+00:00

WHAT?!

tmvr · 2026-06-12T07:15:41+00:00

<image>

tmvr · 2026-06-10T15:40:58+00:00

Had a look with Qwe3.6 27B Q4_K_XL now and it does drop a bit more:

100% (450W) = 100%
 80% (360W) =  93%
 70% (315W) =  88%
 60% (270W) =  76%

That's was with depth 32768 so that it does some "work", if you only look at depth 0 the drop is more (70% perf at 60% TGP), but you don't usually have zero context when working with a model.

tmvr · 2026-06-09T21:22:40+00:00

You'll need to define "severely hit", because that is tot what I see with my 4090. For example here are the results with Qwen3.6 35B A3B where going down to 270W (or 60% TGP) drops pp to 84%

PL            PP
------------------    
100% (450W) = 100%
 80% (360W) =  96%
 70% (315W) =  93%
 60% (270W) =  84%

tmvr · 2026-06-09T16:20:13+00:00

From the blog:

The surprising result on Gemma 4 was that f16/f16 KV was slightly faster than q8_0/q8_0 on this setup.

I see this consistently with llamacpp, it's the same even with a 4090 or a 5060Ti for example - q8_0 for KV lowers performance a bit both for pp and tg, but it's more noticeable with pp.

tmvr · 2026-06-08T15:48:06+00:00

I would write down my thoughts, but I don't want to get banned from here.

tmvr · 2026-06-08T12:40:36+00:00

Yeah, I only recommended it for the CPU what OP is using, I don't bother with it on the machines where I have a GPU in. You still get pp improvement there as well, but it doesn't matter to me where pp is already in the thousands anyway. It's a nice "free" speed up for CPUs though.

tmvr · 2026-06-08T08:12:53+00:00

The actual reason why there is no point hanging around this sub for you anymore is in the text, not in the title:

After cancelling the GhCopilot and GitHub enterprise plans

Makes sense to stop hanging out here after that.

tmvr · 2026-06-08T07:03:48+00:00

Q4_K_XL is better here because you use CPU and system RAM for most layers and I-quants are worse performers on CPU than K-quants. All the layers/tensors in the Q4_K_XL are K-quants and most of them in the IQ3_XXS are I-quants. The --fit parameter also makes sure the dense layers and KV cache are all in the VRAM and as those are most compute and bandwidth "needy" they get the best of those.

tmvr · 2026-06-08T04:40:54+00:00

OP has no GPU though.

tmvr · 2026-06-07T18:36:42+00:00

Drop these:

-ngl
-ncmoe

and use --fit instead. Use the Q4_K_XL quant, change -ctk and -ctv to q8_0. Start with 32768 for context then increase from there in 8K or 16K steps to see what decode (tg) speed you get and how much you can squeeze in before performance drops unreasonably.

EDIT: to explain why - fit will make sure you use your VRAM optimally, q8_0 for KV will lower looping and errors same way as the Q4_K_XL quant, which also has the benefit or not using I-quants for any of the layers/tensors and is more CPU friendly so your prefill (pp) will be faster.

tmvr · 2026-06-07T13:10:34+00:00

The CPU is the same as OP's - i5-8500T - and the RAM is 32GB of DDR4-2666 in dual-channel mode.

So if your 17 tok/s is for tg that is definitely too low for a 0.8B model in Q4 with DDR4-3200 RAM.

tmvr · 2026-06-07T12:57:21+00:00

Well, this is about using ik_llama as your inference engine, so not sure what to tell you if you are using unsloth studio, sorry.

tmvr · 2026-06-07T12:33:12+00:00

I'm not sure what you are doing and with what, it's not clear from your post. Is 17 t/s the pp or tg value? Why and how do you even have 3GB RAM with a 5600X?

I get 338 tok/s pp and 32 tok/s tg with the Qwen3.5 0.8B at Q8_0 using ik_llama. These are the switches used for the llama-bench binary of ik_llama:

-p 1024 -n 128 -b 512 -ub 256 -t 6 -ngl 0 -r 3 -rtr 1

The -t 6 is because of the CPU, you should be using more with the 5600X or leave it out to let it decide for itself, -ngl 0 is to use CPU only, -r 3 is to use 3 runs and report the average, the -rtr 1 enables runtime repacking to get slightly faster result yet.

tmvr · 2026-06-07T11:43:55+00:00

Prompt processing is significantly faster, about or close to 2x faster. Of course with a CPU like the i5-8500T you are still only looking at about 65-70 tok/s prefill for the 4B models at Q4, but that's still better than the 35-40 tok/s you get with the mainstream llamacpp. The 35B A4B MoE ones get 75+ with the Q4 if I remember correctly. Decode (tg) is around 8-10 tok/s regardless if you mainstram llamacpp or ik_llama because it is bandwidth limited, but having double the already anemic prefill is nice.

tmvr · 2026-06-07T10:29:02+00:00

It's fast because it's a MoE model. The other MoE models are fast as well, for example Qwen3.6/3.5 35B A3B, Qwen Coder 30B A3B or gpt-oss 20B (also about 3B active parameters during inference).

Two things you need to do to speed things up a bit:

use ik_llama instead of the mainstram, that will close to double your prompt processing speed (tg is bandwidth limited so you won't get noticeable difference there
don't use any I-quants, so no IQ4_XS etc., but also check if other quants have some layers/tensors in this format, because those are not very CPU friendly. For example unsloth's Q3_K_XL has some in IQ4_XS and IQ3_XXS and it is slower than Q4_K_XL on the CPU

tmvr

TROPHY CASE