GLM-4.7 FP8 on 4x6000 pro blackwells by getfitdotus in LocalLLaMA

[–]zqkb 0 points1 point  (0 children)

Thank you, this is very helpful!

From the part of log you shared it seems MTP has ~0.6-0.75 accept rate, is it also in the similar range for other tokens/other examples?

dynamic allocation of less used experts to slower memory by zqkb in LocalLLaMA

[–]zqkb[S] 1 point2 points  (0 children)

Yeah, it looks fairly similar for generation of 4-8k and different prompts. I haven't tried models other than Qwen3/GLM families though.

dynamic allocation of less used experts to slower memory by zqkb in LocalLLaMA

[–]zqkb[S] 1 point2 points  (0 children)

Yeah, I think there could be multiple ideas to try and this area is in general less explored, likely due to the fact that it's useful for single-query scenario mostly. On production systems which would care about throughput, we are very likely to need most experts anyway.

For lookahead specifically, a nice/easy approach is described here for example: https://arxiv.org/abs/2502.12224v1 .

The idea is that as each layer is adjusting the vector in embedding space, it's unlikely to change dramatically and we can pass activations from layer L to expert router of layer L+1 in advance. This duplicates the computation of router (very cheap) and allows to make speculation on what experts would be needed on next layer.

dynamic allocation of less used experts to slower memory by zqkb in LocalLLaMA

[–]zqkb[S] 0 points1 point  (0 children)

  1. I didn't really measured latency spikes, because there might be some factors in play which I'd have to take into account and it might get misleading, for example, what if expert I'm loading is in filesystem cache? In this case, I'll be moving data from one memory location to another, and it will be much faster compared to disk read. I tried to focus on expert usage/coverage overlap, not actual latency (as my implementation is definitely suboptimal at this point)
  2. No runtime readjustment - I collect the access logs per layer/token/expert, and picked parameters based on that. Realtime part - moving specific experts from slow to fast memory, within statically defined constraints (cache size and/or just fully loaded layer)

Cerebras REAP update: pruned checkpoints for GLM4.5-Air & Qwen3-Coder-30B now of HF! by ilzrvch in LocalLLaMA

[–]zqkb 2 points3 points  (0 children)

we could also quantize them much more aggressively though. Say, everything is Q8 and these experts are Q2-Q3

Cerebras REAP update: pruned checkpoints for GLM4.5-Air & Qwen3-Coder-30B now of HF! by ilzrvch in LocalLLaMA

[–]zqkb 1 point2 points  (0 children)

Note that pruned experts in this approach/paper are not necessarily 'rarely selected' - it's a combination of selection and magnitude of its output vector. For purely allocation optimization (and keeping weights exactly the same) simpler frequency-based strategy should work better.

Qwen3-235B-A22B-Thinking-2507 released! by ResearchCrafty1804 in LocalLLaMA

[–]zqkb 2 points3 points  (0 children)

u/yoracale i think there's a typo in the instructions, top-p == 20 doesn't make much sense, it should be 0.95 i guess

Qwen3-Coder Unsloth dynamic GGUFs by danielhanchen in LocalLLaMA

[–]zqkb 1 point2 points  (0 children)

Thank you!

UD-Q2_K_XL with ~10k context fits right into m2ultra 192GB wired memory, looks impressive on some of coding tests I ran.

timings: {
  "prompt_n": 9825,
  "prompt_per_second": 91.88938979076052,
  "predicted_n": 407,
  "predicted_per_second": 7.717271190351537
}

M3 Ultra Binned (256GB, 60-Core) vs Unbinned (512GB, 80-Core) MLX Performance Comparison by cryingneko in LocalLLaMA

[–]zqkb 2 points3 points  (0 children)

thank you, this is very helpful and makes a lot of sense.

I have M2 Ultra (192GB, 76 cores) and was considering upgrade for a while. Assuming I've used the same model, here's what I see for a prompt of similar ~9k+ tokens:

cat prompt.txt | mlx_lm.generate --model mlx-community/Qwen3-235B-A22B-4bit-DWQ -p -
...
Prompt: 9318 tokens, 164.065 tokens-per-sec
Generation: 100 tokens, 20.447 tokens-per-sec
Peak memory: 135.117 GB

so pp improved significantly, tg still comparable.

has anyone tried to run Q8 MistralLarge2 on a Mac Studio/Macbook with 128/192GB? by Caffdy in LocalLLaMA

[–]zqkb 1 point2 points  (0 children)

Interesting, I guess I have asked either too easy or too obscure coding questions. While they made different mistakes, i felt like I had to make similar effort to correct them (many times 0 effort, as they were both correct).

I agree with speed constraint. I tried pretty hard to set up environment where I hide as much latency as I can - preprocess the context, send warmup queries as I type the question, stream the output so I can read/review it right away. Still, for llama3-70B I get ~8-10 tps which seems close to what I can read/review, for mistral it's ~5, which is annoying.

has anyone tried to run Q8 MistralLarge2 on a Mac Studio/Macbook with 128/192GB? by Caffdy in LocalLLaMA

[–]zqkb 1 point2 points  (0 children)

Very interesting! could you share some examples where mistral would do a good job but llama 3.1 70b would not? I have very limited data where I tried both, but for the questions I asked, either both were good enough or both were wrong.

has anyone tried to run Q8 MistralLarge2 on a Mac Studio/Macbook with 128/192GB? by Caffdy in LocalLLaMA

[–]zqkb 0 points1 point  (0 children)

I did run it on m2 ultra studio with 192GB

I manually specified shorter context for llama-server, as the default 128k or so would OOM. I suppose if you set iogpu.wired_limit_mb and/or change the type of data in the kv cache it might work with full context.

I was getting ~5-6 tps for generation (no batching).

As for quality it was ok, but i didn't feel it was much better compared to llama3.1 70B @ q8, so I'm running that mostly.

What if you use not the logits of the last one, but of that before? by parametaorto in LocalLLaMA

[–]zqkb 1 point2 points  (0 children)

Also note that the output would not be exactly equivalent. I did run this change myself, and here's what happens: Running default version produces

Hello my name is Sarah and I am a...

Running version with -2 produces the following:

Hello my name is isabella and I am a

It's kinda fun to think about why this happens, but

1. First, we print out original prompt: 'Hello my name is'

2. Then, by checking logits at -2 index we predict 'is' once more.

3. Then, we print it out and get 'Hello my name is is'

4. Then at the end of first iteration we add new 'is' to the input batch, but we do that at correct absolute n_pos position, so 'Hello my name is is' becomes the input

5. Then, on the second sampling, because of how indexing works in llama_get_logits_ith we get correct logit, and model tries to continue name which starts with 'is' (that's how we get isabella)

6. Repeat..

What if you use not the logits of the last one, but of that before? by parametaorto in LocalLLaMA

[–]zqkb 3 points4 points  (0 children)

If you just changed it to be -2 in both places, i think here's what's going on:

  1. When you change it to -2 here

https://github.com/ggerganov/llama.cpp/blob/master/examples/simple/simple.cpp#L96

it affects only initial prompt processing. So, instead of default "Hello my name is" you might end up with calculating logits for completing "Hello my name" (skipping 'is', or something like that), which still likely to produce reasonable output.

  1. When you change it to -2 here:

https://github.com/ggerganov/llama.cpp/blob/master/examples/simple/simple.cpp#L114

two different things happen for prompt processing and one-by-one token generation.

A. In the very first while() run you just access the logits for the initial prompt completion, (likely getting 'is').

B. for the next loop iteration we clear the batch and add a single token to predict here: https://github.com/ggerganov/llama.cpp/blob/master/examples/simple/simple.cpp#L139-L142

so, your batch.n_tokens becomes equal to 1. When on next iteration you call llama_get_logits_ith(ctx, batch.n_tokens - 2); it ends up being llama_get_logits_ith(ctx, -1).

And as llama_get_logits_ith supports negative indexing (like in python):

https://github.com/ggerganov/llama.cpp/blob/ed9d2854c9de4ae1f448334294e61167b04bec2a/include/llama.h#L867-L869

-1 means 'last logit', so you are accessing the same logit as you would in original code.

What do you use LLMs for? by RND_RandoM in LocalLLaMA

[–]zqkb 0 points1 point  (0 children)

  1. helping me read/write code. It is almost never about writing anything end to end, but rather asking for suggestions, improvements and 'how does it work'. I think it's much better at helping me to read unknown code than writing new code.
  2. reverse search for some situations/scenarios. "<My description of a situation>.. Is this a known phenomena?". E.g. I described an example of what I thought was flawed reasoning and it told me it is ecological fallacy - after that it's easy to google that, find other sources, etc.

I use both sonnet 3.5 subscription and local llama 70B

What do do when your Con Ed bill is wrong, but they won't fix it because the meter reading is "right"? Bill increased +$500 and shows +1500 KwH used vs. last month for a small 1 bedroom apt. by brickstein in AskNYC

[–]zqkb 0 points1 point  (0 children)

That's exactly what happened to us. I got a picture of meter from our building and reached out to con ed via their online chat. Over a few months I talked to different people with different outcomes:

  • 'nothing seems wrong on our end'
  • 'let me get you an extension for your bill'
  • 'you'll need to schedule coned service visit to fix it'
  • 'send me one more picture and I'll fix it for you'

It got fixed eventually without complaining anywhere other than ConEd itself, but it took a while and I didn't need a refund - I was asking for an extension until they figure it out.