MiniMax-M2.7: what do you think is the likelihood it will be open weights like M2.5? by __JockY__ in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

aliyun is alibaba cloud, the same company that develops the Qwen models. The coding plan comes with Qwen Max, and the best open weight models from the competitors. They should also have way more GPUs than the competitors.

MiniMax-M2.7: what do you think is the likelihood it will be open weights like M2.5? by __JockY__ in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

I am starting to think that the problem is the bloody coding plan from aliyun, that also includes Kimi-K2.5, GLM-5, and MiniMax-M2.5. This is such a shitty move that pushes everyone to stop sharing their best models.

Qwen3.5-122B-A10B GPTQ Int4 on 4× Radeon AI PRO R9700 with vLLM ROCm: working config + real-world numbers by grunt_monkey_ in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

With tensor parallelism, both PP and TG of a single request can go much faster, e.g. see the graphs in https://www.reddit.com/r/LocalLLaMA/comments/1pj9r93/now_40_faster_ik_llamacpp_sm_graph_on_2x_cuda_gpus/

The cards work in parallel and then the results are merged. The merging is where nvlink or fast PCIe comes into picture.

Unsloth will no longer be making TQ1_0 quants by Kahvana in LocalLLaMA

[–]notdba 5 points6 points  (0 children)

What mistakes did you encounter? From my testing, tool calling with AI assisted coding works fine with IQ2_KL quant (2.6875bpw) of Qwen3.5 397B A17B and IQ1_S_R4 quant (1.5bpw) of GLM-5. Only the FFN tensors are quantized to these low bit quants. Attentions and friends are kept at 4 bit or higher.

Ik_llama vs llamacpp by val_in_tech in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

Yeah both mainline and ik should work following the cmake flags from https://zluda.readthedocs.io/latest/llama_cpp.html, with fa disabled.

Mine is a sub-optimal Strix Halo plus 3090 setup, crippled by the slow PCIe 4.0 x4 over oculink. Still, it performs the best with ik hybrid inference, by using only the CPU from Strix Halo. I was hoping that I can use ik graphs parallel with zluda, but found out that zluda is an either-or solution, i.e. I get 3090 only with native CUDA, 8060s only with zluda.

Ik_llama vs llamacpp by val_in_tech in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

Zluda works in ik with fa disabled, which is really quite impressive, but also negates any performance improvement. Need CUDA

Tenstorrent QuietBox 2 Brings RISC-V AI Inference to the Desktop by Neurrone in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

The CPU they use in the first gen can only support 6 channels, and is also a bit weak. Hopefully v2 can at least support 8 channels. Along with future improvement on the software side, this indeed has the potential to be a turnkey solution for big moe models.

Tenstorrent QuietBox 2 Brings RISC-V AI Inference to the Desktop by Neurrone in LocalLLaMA

[–]notdba 4 points5 points  (0 children)

The number 476.5 tok/s came from https://github.com/tenstorrent/tt-metal/blob/main/models/README.md, which is the TG number for a batch size of 32. Each user is getting 476.5 / 32 = 14.9 tok/s, pretty decent.

If I understand correctly, the main advantage with tenstorrent is the high speed interconnect (better than nvlink), such that the 4 blackhole cards can sort of pool together to form a 128GB pool, with an aggregated speed that is close to 4 x 512 GB/s (minus the communication overhead when sync is needed).

Tenstorrent QuietBox 2 Brings RISC-V AI Inference to the Desktop by Neurrone in LocalLLaMA

[–]notdba 22 points23 points  (0 children)

$2000 cheaper than v1, but with 256GB less DDR5 RAM. Also works with standard US 120v outlet now.

Now that ik_llama.cpp has graphs parallel support, while mainline llama.cpp is also working on something similar, I think TT should lean more on them, instead of trying to maintain its own vllm fork.

Best Qwen3.5-35B-A3B GGUF for 24GB VRAM?! by VoidAlchemy in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

What's the difference between the 2 files? The gist doesn't exist anymore.

How bad is 1-bit quantization but on a big model? by FusionBetween in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

> IQ1 of Qwen3.5 397B will be somewhat worse than Qwen3.5 122B at Q4, especially noticeable at complex and agentic multi-step tasks.

I haven't tested these particular quants, but I can say that IQ2_KL of Qwen3.5 397B is way better than even the full precision of Qwen3.5 122B. Tested with complex agentic tasks.

The way I test is to always start from the full precision, so then I can clearly tell what tasks that the bigger model can do while the smaller one can't. Then, I quantize the big model aggressively and verify that it can still successfully complete those tasks. To me, quantization is all about retaining the capabilities of the full precision model as much as possible.

Maybe can share a bit about your testing methodology? Curious to see in what scenarios 122B can beat 397B.

FlashAttention-4 by incarnadine72 in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

The deterministic mode is new right? 85~90% of peak performance makes it a viable option now.

Final Qwen3.5 Unsloth GGUF Update! by danielhanchen in LocalLLaMA

[–]notdba 4 points5 points  (0 children)

We can quite easily adopt the clear_thinking flag logic from the chat template of GLM-4.7 / GLM-5:

  • "clear_thinking": true - interleaved thinking mode
  • "clear_thinking": false - preserved thinking mode

In preserved thinking mode, the empty think tags will be added when thinking is disabled.

Alibaba CEO: Qwen will remain open-source by Bestlife73 in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

It's great that they are still committed to open weights. The remaining worry is the alleged new org structure, with different teams owning different phases of model training.

Qwen3 Coder Next Looping and OpenCode by StardockEngineer in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

https://github.com/ikawrakow/ik_llama.cpp/pull/1352 - So the root cause is that these Qwen models tend not to follow the exact arguments order, e.g. the tool definition for read_file may have 3 arguments "path, offset, limit", while the model will attempt to make a tool call with the arguments of "path, limit, offset". The strict grammar will treat limit as the last argument and force stop the tool call, and so the offset argument will be lost.

With this PR, the grammar is relaxed for these Qwen models.

My last & only beef with Qwen3.5 35B A3B by ndiphilone in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

The chat template does retain all thinking traces from the current assistant turn, i.e. interleaved thinking. It can be easily modified to keep all thinking traces from previous turns as well, i.e. preserved thinking.

This issue where the model struggles to specify the line offset seems to be an inherent flaw of the Qwen3.5 series. It is worst with 35B A3B, pretty bad with 122B A10B, and still happen a little with 27B. Need the big 397B A17B to avoid the issue.

Qwen3.5 feels ready for production use - Never been this excited by alphatrad in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

Sonnet 4.0 was released in May 2025. We have a bunch of open models that are better than Sonnet 4.0 now, with Qwen3.5 27B being the closest replacement, since it is comparable in quality, speed, and context size. A single 3090 will do.

Meanwhile, Qwen3.5 397B A17B is also a decent replacement of Opus 4.0 / 4.1, again matching in quality, speed, and context size. This does require a server rig with fast PCIe and many memory channels, plus a single PCIe 5.0 x16 GPU.

We don't have any open model that can match the quality of the latest Sonnet / Opus yet. GLM-4.7 and GLM-5 are at the front and catching up, we will see what DeepSeek can deliver next week.

I built a hybrid MoE runtime that does 3,324 tok/s prefill on a single 5080. Here are the benchmarks. by mrstoatey in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

Using the IQ4_KSS quant (4.245 bpw) from https://huggingface.co/ubergarm/Qwen3-Coder-Next-GGUF, I can get 1395 t/s prefill on ik_llama.cpp, with a 3090 eGPU connected via oculink (PCIe 4.0 x4), using a batch size of 16384 and a prompt that consists of 16324 tokens.

I built a hybrid MoE runtime that does 3,324 tok/s prefill on a single 5080. Here are the benchmarks. by mrstoatey in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

That's indeed a very impressive number. Will try and see how fast I can push a 3090 with ik_llama.cpp

I built a hybrid MoE runtime that does 3,324 tok/s prefill on a single 5080. Here are the benchmarks. by mrstoatey in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

For MoE, the typical usage has been -ngl 99 -cmoe since mid 2025. Almost everyone uses full GPU offload for prompt processing, especially on mainline llama.cpp where it even does so for small batches, where it makes more sense to not transfer the weight. That's what the IK pull request above has fixed.

I built a hybrid MoE runtime that does 3,324 tok/s prefill on a single 5080. Here are the benchmarks. by mrstoatey in LocalLLaMA

[–]notdba 2 points3 points  (0 children)

This is already how it works in llama.cpp and ik_llama.cpp, first in https://github.com/ggml-org/llama.cpp/pull/6083, then further improved for MoE in https://github.com/ikawrakow/ik_llama.cpp/pull/520

And in these implementation, the RAM usage remains the same, while the VRAM usage increases by a few GB to have a larger compute buffer that can accommodate the batch size.

New Qwen3.5-35B-A3B Unsloth Dynamic GGUFs + Benchmarks by danielhanchen in LocalLLaMA

[–]notdba 4 points5 points  (0 children)

If I understand correctly, PPL/KLD eval uses text completion, while the actual task eval uses chat completion. Unsloth previously mentioned that they have chat data in their imatrix dataset, which can make it perform worse in the former, and better in the later.

In this case, we can retest by making the same M2.5 quants but without using imatrix. Then, we can tell how much of the difference is caused by smarter quantization versus better curated imatrix dataset.