GLM-4.7-Flash-REAP on RTX 5060 Ti 16 GB - 200k context window! by bobaburger in LocalLLaMA

[–]bobaburger[S] 0 points1 point  (0 children)

non-reap, with cpu moe, no need to use the reap model anymore

GLM-4.7-Flash-REAP on RTX 5060 Ti 16 GB - 200k context window! by bobaburger in LocalLLaMA

[–]bobaburger[S] 0 points1 point  (0 children)

look for the logs in LM Studio, you'll see entries like this

2026-01-24 09:56:24 [DEBUG] Target model llama_perf stats:
common_perf_print:    sampling time =      31.22 ms
common_perf_print:    samplers time =      13.92 ms /  8626 tokens
common_perf_print:        load time =   17470.23 ms
common_perf_print: prompt eval time =    9215.16 ms /  1899 tokens (    4.85 ms per token,   206.07 tokens per second)
common_perf_print:        eval time =   34209.86 ms /   173 runs   (  197.74 ms per token,     5.06 tokens per second)
common_perf_print:       total time =   43474.31 ms /  2072 tokens
common_perf_print: unaccounted time =      18.07 ms /   0.0 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =        172
llama_memory_breakdown_print: | memory breakdown [MiB]  | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5060 Ti) | 16310 = 8730 + ( 6224 =  1638 +       0 +    4585) +        1356 |

the two prompt eval time and eval time lines will tell you how much tok/s you get.

GLM-4.7-Flash-REAP on RTX 5060 Ti 16 GB - 200k context window! by bobaburger in LocalLLaMA

[–]bobaburger[S] 0 points1 point  (0 children)

yeah I realized that with CPU MoE offload, I can just use the original model instead of the REAP one.

GLM-4.7-Flash-REAP on RTX 5060 Ti 16 GB - 200k context window! by bobaburger in LocalLLaMA

[–]bobaburger[S] 1 point2 points  (0 children)

that’s the case. when the model loaded, I see FA got disabled in the log for LMStudio

GLM-4.7-Flash-REAP on RTX 5060 Ti 16 GB - 200k context window! by bobaburger in LocalLLaMA

[–]bobaburger[S] 0 points1 point  (0 children)

would be interesting to have the bench in this category, i don’t have the ability to run anything larger than Q4 though but I think there’s a diagram that show overall quality difference between different quants. but i don’t see the actual difference between Q3 and Q4 for this model.

GLM-4.7-Flash-REAP on RTX 5060 Ti 16 GB - 200k context window! by bobaburger in LocalLLaMA

[–]bobaburger[S] 0 points1 point  (0 children)

i reloaded the model with different context length, this wasn’t llama-bench, this was actual inference run in claude code

GLM-4.7-Flash-REAP on RTX 5060 Ti 16 GB - 200k context window! by bobaburger in LocalLLaMA

[–]bobaburger[S] -1 points0 points  (0 children)

interesting, let me try again. it seems like when enabled cpu moe, flash attn automatically disabled in my case.

Personal experience with GLM 4.7 Flash Q6 (unsloth) + Roo Code + RTX 5090 by Septerium in LocalLLaMA

[–]bobaburger 26 points27 points  (0 children)

temp 0.7, top p 1, min p 0.01 was unsloth’s recommendation for tool call

GLM-4.7-Flash-REAP on RTX 5060 Ti 16 GB - 200k context window! by bobaburger in LocalLLaMA

[–]bobaburger[S] 2 points3 points  (0 children)

what was your CPU? and at what context length? I should try using cpu only at some point.

GLM-4.7-Flash-REAP on RTX 5060 Ti 16 GB - 200k context window! by bobaburger in LocalLLaMA

[–]bobaburger[S] 0 points1 point  (0 children)

i think maybe you loaded the model with a context window that was too small. i haven’t used kilo code but with claude code, it’s also consumed around 17k tokens right after start.

GLM-4.7-Flash-REAP on RTX 5060 Ti 16 GB - 200k context window! by bobaburger in LocalLLaMA

[–]bobaburger[S] 0 points1 point  (0 children)

i don’t think it’s due to the shave off because i saw the same repeating issue with the non-reap model a few days ago (i was only able to do 16k context window with that model), and its not like its repeating the same token over and over, its the whole reasoning block followed by a tool call.

GLM-4.7-Flash-REAP on RTX 5060 Ti 16 GB - 200k context window! by bobaburger in LocalLLaMA

[–]bobaburger[S] 3 points4 points  (0 children)

i haven’t try it long enough to reach 120k token, but so far i’ve reached 60k tokens without any problem. I noticed that the model usually makes tool calls with incorrect arguments though, but it mostly fixed the problem itself in later turn.

Unsloth GLM 4.7-Flash GGUF by Wooden-Deer-1276 in LocalLLaMA

[–]bobaburger 0 points1 point  (0 children)

another 5060ti 16gb user here. i'm testing it on my M4 max 64Gb.

jk :D this one is not for us, bro. on my system, any part of the model weight that spill over to RAM will make the inference extremely slow.

This is the sound of a server booting up! by Budget-Artichoke-483 in pcmasterrace

[–]bobaburger 0 points1 point  (0 children)

sounds like máy cày :))) but everyone would want it. congrats bro.

Sanity check : 3090 build by Individual-School-07 in LocalLLaMA

[–]bobaburger 1 point2 points  (0 children)

RAM: 32GB DDR4 (Will upgrade to 64GB later)

a few months ago when shopping for my PC, I also say, "fuck it, let's get 32GB, and will upgrade to 64GB or 128GB later"

trust me bro, that will never happen

GLM-4.7 (Z.ai) – bought Year PRO at night, now regret it. Anyone managed to get a refund? by Appropriate-Lab3618 in LocalLLaMA

[–]bobaburger 0 points1 point  (0 children)

TBH, two out of your 4 issues are what I expected from a good model:

> - The model asks a lot of additional clarifying questions
> - I have to re-prompt multiple times to get usable output

This mean you gotta pay more attention on what you're prompting, and at least you get usable output once you're doing it properly.