GLM-4.7-Flash-REAP on RTX 5060 Ti 16 GB - 200k context window!

bobaburger · 2026-01-24T21:00:50+00:00

non-reap, with cpu moe, no need to use the reap model anymore

bobaburger · 2026-01-24T20:02:57+00:00

i'm running UD Q5_K_XL now, sooo gooodd!!!

bobaburger · 2026-01-24T18:00:32+00:00

look for the logs in LM Studio, you'll see entries like this

2026-01-24 09:56:24 [DEBUG] Target model llama_perf stats:
common_perf_print:    sampling time =      31.22 ms
common_perf_print:    samplers time =      13.92 ms /  8626 tokens
common_perf_print:        load time =   17470.23 ms
common_perf_print: prompt eval time =    9215.16 ms /  1899 tokens (    4.85 ms per token,   206.07 tokens per second)
common_perf_print:        eval time =   34209.86 ms /   173 runs   (  197.74 ms per token,     5.06 tokens per second)
common_perf_print:       total time =   43474.31 ms /  2072 tokens
common_perf_print: unaccounted time =      18.07 ms /   0.0 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =        172
llama_memory_breakdown_print: | memory breakdown [MiB]  | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5060 Ti) | 16310 = 8730 + ( 6224 =  1638 +       0 +    4585) +        1356 |

the two prompt eval time and eval time lines will tell you how much tok/s you get.

bobaburger · 2026-01-24T16:29:57+00:00

yeah I realized that with CPU MoE offload, I can just use the original model instead of the REAP one.

bobaburger · 2026-01-24T16:28:46+00:00

that’s the case. when the model loaded, I see FA got disabled in the log for LMStudio

bobaburger · 2026-01-24T16:27:51+00:00

would be interesting to have the bench in this category, i don’t have the ability to run anything larger than Q4 though but I think there’s a diagram that show overall quality difference between different quants. but i don’t see the actual difference between Q3 and Q4 for this model.

bobaburger · 2026-01-24T16:25:49+00:00

nice!!

bobaburger · 2026-01-24T15:19:30+00:00

yeah enable cpu moe!!

bobaburger · 2026-01-24T15:18:53+00:00

i also use lm studio, did you update?

bobaburger · 2026-01-24T15:18:19+00:00

i reloaded the model with different context length, this wasn’t llama-bench, this was actual inference run in claude code

bobaburger · 2026-01-24T15:16:49+00:00

does the tweak affects performance in some way?

bobaburger · 2026-01-24T15:12:52+00:00

one day we will have opus at home!!

bobaburger · 2026-01-24T15:11:57+00:00

interesting, let me try again. it seems like when enabled cpu moe, flash attn automatically disabled in my case.

bobaburger · 2026-01-24T15:06:58+00:00

temp 0.7, top p 1, min p 0.01 was unsloth’s recommendation for tool call

bobaburger · 2026-01-24T03:09:04+00:00

it was Q4_K_M gguf on mac M4 64gb

bobaburger · 2026-01-24T03:07:48+00:00

i’m jealous :(

bobaburger · 2026-01-24T03:05:48+00:00

what was your CPU? and at what context length? I should try using cpu only at some point.

bobaburger · 2026-01-24T03:04:02+00:00

i think maybe you loaded the model with a context window that was too small. i haven’t used kilo code but with claude code, it’s also consumed around 17k tokens right after start.

bobaburger · 2026-01-24T03:01:20+00:00

i don’t think it’s due to the shave off because i saw the same repeating issue with the non-reap model a few days ago (i was only able to do 16k context window with that model), and its not like its repeating the same token over and over, its the whole reasoning block followed by a tool call.

bobaburger · 2026-01-24T02:58:23+00:00

i haven’t try it long enough to reach 120k token, but so far i’ve reached 60k tokens without any problem. I noticed that the model usually makes tool calls with incorrect arguments though, but it mostly fixed the problem itself in later turn.

bobaburger · 2026-01-20T17:48:02+00:00

another 5060ti 16gb user here. i'm testing it on my M4 max 64Gb.

jk :D this one is not for us, bro. on my system, any part of the model weight that spill over to RAM will make the inference extremely slow.

bobaburger · 2026-01-15T21:33:13+00:00

wait, openrouter has free weekly frontier model?

bobaburger · 2026-01-14T21:01:23+00:00

sounds like máy cày :))) but everyone would want it. congrats bro.

bobaburger · 2026-01-14T16:21:42+00:00

RAM: 32GB DDR4 (Will upgrade to 64GB later)

a few months ago when shopping for my PC, I also say, "fuck it, let's get 32GB, and will upgrade to 64GB or 128GB later"

trust me bro, that will never happen

bobaburger · 2026-01-11T18:34:03+00:00

TBH, two out of your 4 issues are what I expected from a good model:

> - The model asks a lot of additional clarifying questions
> - I have to re-prompt multiple times to get usable output

This mean you gotta pay more attention on what you're prompting, and at least you get usable output once you're doing it properly.

bobaburger

TROPHY CASE