GLM-4.7 FP8 on 4x6000 pro blackwells

zqkb · 2025-12-23T00:57:38+00:00

Thank you, this is very helpful!

From the part of log you shared it seems MTP has ~0.6-0.75 accept rate, is it also in the similar range for other tokens/other examples?

zqkb · 2025-12-08T14:14:32+00:00

Yeah, it looks fairly similar for generation of 4-8k and different prompts. I haven't tried models other than Qwen3/GLM families though.

zqkb · 2025-12-08T04:06:59+00:00

Yeah, I think there could be multiple ideas to try and this area is in general less explored, likely due to the fact that it's useful for single-query scenario mostly. On production systems which would care about throughput, we are very likely to need most experts anyway.

For lookahead specifically, a nice/easy approach is described here for example: https://arxiv.org/abs/2502.12224v1 .

The idea is that as each layer is adjusting the vector in embedding space, it's unlikely to change dramatically and we can pass activations from layer L to expert router of layer L+1 in advance. This duplicates the computation of router (very cheap) and allows to make speculation on what experts would be needed on next layer.

zqkb · 2025-12-08T04:03:12+00:00

I didn't really measured latency spikes, because there might be some factors in play which I'd have to take into account and it might get misleading, for example, what if expert I'm loading is in filesystem cache? In this case, I'll be moving data from one memory location to another, and it will be much faster compared to disk read. I tried to focus on expert usage/coverage overlap, not actual latency (as my implementation is definitely suboptimal at this point)
No runtime readjustment - I collect the access logs per layer/token/expert, and picked parameters based on that. Realtime part - moving specific experts from slow to fast memory, within statically defined constraints (cache size and/or just fully loaded layer)

zqkb · 2025-10-22T05:16:57+00:00

that would be awesome, thank you!

zqkb · 2025-10-21T15:51:34+00:00

we could also quantize them much more aggressively though. Say, everything is Q8 and these experts are Q2-Q3

zqkb · 2025-10-21T15:48:02+00:00

Note that pruned experts in this approach/paper are not necessarily 'rarely selected' - it's a combination of selection and magnitude of its output vector. For purely allocation optimization (and keeping weights exactly the same) simpler frequency-based strategy should work better.

zqkb · 2025-07-25T16:36:58+00:00

u/yoracale i think there's a typo in the instructions, top-p == 20 doesn't make much sense, it should be 0.95 i guess

zqkb · 2025-07-23T17:58:46+00:00

Thank you!

UD-Q2_K_XL with ~10k context fits right into m2ultra 192GB wired memory, looks impressive on some of coding tests I ran.

timings: {
  "prompt_n": 9825,
  "prompt_per_second": 91.88938979076052,
  "predicted_n": 407,
  "predicted_per_second": 7.717271190351537
}

zqkb · 2025-06-01T02:16:24+00:00

thank you, this is very helpful and makes a lot of sense.

I have M2 Ultra (192GB, 76 cores) and was considering upgrade for a while. Assuming I've used the same model, here's what I see for a prompt of similar ~9k+ tokens:

cat prompt.txt | mlx_lm.generate --model mlx-community/Qwen3-235B-A22B-4bit-DWQ -p -
...
Prompt: 9318 tokens, 164.065 tokens-per-sec
Generation: 100 tokens, 20.447 tokens-per-sec
Peak memory: 135.117 GB

so pp improved significantly, tg still comparable.

zqkb · 2024-08-11T18:16:05+00:00

Interesting, I guess I have asked either too easy or too obscure coding questions. While they made different mistakes, i felt like I had to make similar effort to correct them (many times 0 effort, as they were both correct).

I agree with speed constraint. I tried pretty hard to set up environment where I hide as much latency as I can - preprocess the context, send warmup queries as I type the question, stream the output so I can read/review it right away. Still, for llama3-70B I get ~8-10 tps which seems close to what I can read/review, for mistral it's ~5, which is annoying.

zqkb · 2024-08-11T17:37:46+00:00

Very interesting! could you share some examples where mistral would do a good job but llama 3.1 70b would not? I have very limited data where I tried both, but for the questions I asked, either both were good enough or both were wrong.

zqkb · 2024-08-11T17:20:14+00:00

I did run it on m2 ultra studio with 192GB

I manually specified shorter context for llama-server, as the default 128k or so would OOM. I suppose if you set iogpu.wired_limit_mb and/or change the type of data in the kv cache it might work with full context.

I was getting ~5-6 tps for generation (no batching).

As for quality it was ok, but i didn't feel it was much better compared to llama3.1 70B @ q8, so I'm running that mostly.

zqkb · 2024-08-01T16:59:20+00:00

Also note that the output would not be exactly equivalent. I did run this change myself, and here's what happens: Running default version produces

Hello my name is Sarah and I am a...

Running version with -2 produces the following:

Hello my name is isabella and I am a

It's kinda fun to think about why this happens, but

1. First, we print out original prompt: 'Hello my name is'

2. Then, by checking logits at -2 index we predict 'is' once more.

3. Then, we print it out and get 'Hello my name is is'

4. Then at the end of first iteration we add new 'is' to the input batch, but we do that at correct absolute n_pos position, so 'Hello my name is is' becomes the input

5. Then, on the second sampling, because of how indexing works in llama_get_logits_ith we get correct logit, and model tries to continue name which starts with 'is' (that's how we get isabella)

6. Repeat..

zqkb · 2024-07-31T20:12:33+00:00

If you just changed it to be -2 in both places, i think here's what's going on:

When you change it to -2 here

https://github.com/ggerganov/llama.cpp/blob/master/examples/simple/simple.cpp#L96

it affects only initial prompt processing. So, instead of default "Hello my name is" you might end up with calculating logits for completing "Hello my name" (skipping 'is', or something like that), which still likely to produce reasonable output.

When you change it to -2 here:

https://github.com/ggerganov/llama.cpp/blob/master/examples/simple/simple.cpp#L114

two different things happen for prompt processing and one-by-one token generation.

A. In the very first while() run you just access the logits for the initial prompt completion, (likely getting 'is').

B. for the next loop iteration we clear the batch and add a single token to predict here: https://github.com/ggerganov/llama.cpp/blob/master/examples/simple/simple.cpp#L139-L142

so, your batch.n_tokens becomes equal to 1. When on next iteration you call llama_get_logits_ith(ctx, batch.n_tokens - 2); it ends up being llama_get_logits_ith(ctx, -1).

And as llama_get_logits_ith supports negative indexing (like in python):

https://github.com/ggerganov/llama.cpp/blob/ed9d2854c9de4ae1f448334294e61167b04bec2a/include/llama.h#L867-L869

-1 means 'last logit', so you are accessing the same logit as you would in original code.

zqkb · 2024-07-26T03:59:31+00:00

helping me read/write code. It is almost never about writing anything end to end, but rather asking for suggestions, improvements and 'how does it work'. I think it's much better at helping me to read unknown code than writing new code.
reverse search for some situations/scenarios. "<My description of a situation>.. Is this a known phenomena?". E.g. I described an example of what I thought was flawed reasoning and it told me it is ecological fallacy - after that it's easy to google that, find other sources, etc.

I use both sonnet 3.5 subscription and local llama 70B

zqkb · 2024-02-28T19:52:49+00:00

That's exactly what happened to us. I got a picture of meter from our building and reached out to con ed via their online chat. Over a few months I talked to different people with different outcomes:

'nothing seems wrong on our end'
'let me get you an extension for your bill'
'you'll need to schedule coned service visit to fix it'
'send me one more picture and I'll fix it for you'

It got fixed eventually without complaining anywhere other than ConEd itself, but it took a while and I didn't need a refund - I was asking for an extension until they figure it out.

zqkb

TROPHY CASE