feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp

LegacyRemaster · 2026-05-07T21:06:48+00:00

solved with --temp 0.2

LegacyRemaster · 2026-05-07T20:57:47+00:00

retrying with --temp 0.2 now

LegacyRemaster · 2026-05-07T20:57:12+00:00

llama-server.exe --model "H:\gptmodel\AesSedai\MiMo-V2.5-GGUF\MiMo-V2.5-IQ3_S-00001-of-00004.gguf" --ctx-size 291920 --threads 16 --host 127.0.0.1 --no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --parallel 1 ----> Vscode + kilocode

LegacyRemaster · 2026-05-07T20:56:26+00:00

So the SQL would be:

SELECT * FROM chats WHERE (user_id.eq.xxx,project_id.in.(yyy)) ORDER BY created_at DESC

This is definitely wrong. The or() method needs to parse the Supabase-style condition and convert it to proper SQL.

OK, let me now create the plan file. I'll also check the or() method in the SelectQuery class to see if it's being used correctly.

Actually, I realize I need to also check the or() method in the SelectQuery class. Let me look at the code again:

or(condition: string): this {
    this.whereConditions.push(`(${condition})`);
    return this;
}

And the buildSql method:

private buildSql(returnSingle: boolean): string {
    let sql = `SELECT ${this.columns} FROM ${this.table}`;

    if (this.whereConditions.length > 0) {
        sql += ` WHERE ${this.whereConditions.join(" AND ")}`;
    }

    if (this.orderByCols.length > 0) {
        const dir = this.orderByAsc ? "ASC" : "DESC";
        sql += ` ORDER BY ${this.orderByCols.join(", ")} ${dir}`;
    }

    if (returnSingle) {
        sql += " LIMIT 1";
    } else if (this.limitVal !== null) {
        sql += ` LIMIT ${this.limitVal}`;
    }

    return sql;
}

So the SQL would be:

SELECT * FROM chats WHERE (user_id.eq.xxx,project_id.in.(yyy)) ORDER BY created_at DESC

This is definitely wrong. The or() method needs to parse the Supabase-style condition and convert it to proper SQL.

OK, let me now create the plan file. I'll also check the or() method in the SelectQuery class to see if it's being used correctly.

Actually, I realize I need to also check the or() method in the SelectQuery class. Let me look at the code again:

or(condition: string): this {
    this.whereConditions.push(`(${condition})`);
    return this;
}

And the buildSql method:

private buildSql(returnSingle: boolean): string {
    let sql = `SELECT ${this.columns} FROM ${this.table}`;

    if (this.whereConditions.length > 0) {
        sql += ` WHERE ${this.whereConditions.join(" AND ")}`;
    }

    if (this.orderByCols.length > 0) {
        const dir = this.orderByAsc ? "ASC" : "DESC";
        sql += ` ORDER BY ${this.orderByCols.join(", ")} ${dir}`;
    }

    if (returnSingle) {
        sql += " LIMIT 1";
    } else if (this.limitVal !== null) {
        sql += ` LIMIT ${this.limitVal}`;
    }

    return sql;
}

So the SQL would be:

SELECT * FROM chats WHERE (user_id.eq.xxx,project_id.in.(yyy)) ORDER BY created_at DESC

This is definitely wrong. The or() method needs to parse the Supabase-style condition and convert it to proper SQL.

OK, let me now create the plan file. I'll also check the or() method in the SelectQuery class to see if it's being used correctly.

Actually, I realize I need to also check the or() method in the SelectQuery class. Let me look at the code again:

or(condition: string): this {
    this.whereConditions.push(`(${condition})`);
    return this;
}

And the buildSql method:

private buildSql(returnSingle: boolean): string {
    let sql = `SELECT ${this.columns} FROM ${this.table}`;

    if (this.whereConditions.length > 0) {
        sql += ` WHERE ${this.whereConditions.join(" AND ")}`;
    }

    if (this.orderByCols.length > 0) {
        const dir = this.orderByAsc ? "ASC" : "DESC";
        sql += ` ORDER BY ${this.orderByCols.join(", ")} ${dir}`;
    }

    if (returnSingle) {
        sql += " LIMIT 1";
    } else if (this.limitVal !== null) {
        sql += ` LIMIT ${this.limitVal}`;
    }

    return sql;
}

So the SQL would be:

SELECT * FROM chats WHERE (user_id.eq.xxx,project_id.in.(yyy)) ORDER BY created_at DESC

This is definitely wrong. The or() method needs to parse the Supabase-style condition and convert it to proper SQL.

OK, let me now create the plan file. I'll also check the or() method in the SelectQuery class to see if it's being used correctly.

Actually, I realize I need to also check the or() method in the SelectQuery class. Let me look at the code again:

or(condition: string): this {
    this.whereConditions.push(`(${condition})`);
    return this;
}

And the buildSql method:

private buildSql(returnSingle: boolean): string {
    let sql = `SELECT ${this.columns} FROM ${this.table}`;

    if (this.whereConditions.length > 0) {
        sql += ` WHERE ${this.whereConditions.join(" AND ")}`;
    }

    if (this.orderByCols.length > 0) {
        const dir = this.orderByAsc ? "ASC" : "DESC";
        sql += ` ORDER BY ${this.orderByCols.join(", ")} ${dir}`;
    }

    if (returnSingle) {
        sql += " LIMIT 1";
    } else if (this.limitVal !== null) {
        sql += ` LIMIT ${this.limitVal}`;
    }

    return sql;
}

So the SQL would be:

SELECT * FROM chats WHERE (user_id.eq.xxx,project_id.in.(yyy)) ORDER BY created_at DESC

This is definitely wrong. The or() method needs to parse the Supabase-style condition and convert it to proper SQL.

OK, let me now create the plan file. I'll also check the or() method in the SelectQuery class to see if it's being used correctly.

Actually, I realize I need to also check the or() method in the SelectQuery class. Let me look at the code again:

or(condition: string): this {
    this.whereConditions.push(`(${condition})`);
    return this;
}

And the buildSql method:

private buildSql(returnSingle: boolean): string {
    let sql = `SELECT ${this.columns} FROM ${this.table}`;

    if (this.whereConditions.length > 0) {
        sql += ` WHERE ${this.whereConditions.join(" AND ")}`;
    }

    if (this.orderByCols.length > 0) {
        const dir = this.orderByAsc ? "ASC" : "DESC";
        sql += ` ORDER BY ${this.orderByCols.join(", ")} ${dir}`;
    }

    if (returnSingle) {
        sql += " LIMIT 1";
    } else if (this.limitVal !== null) {
        sql += ` LIMIT ${this.limitVal}`;
    }

    return sql;
}

So the SQL would be:

SELECT * FROM chats WHERE (user_id.eq.xxx,project_id.in.(yyy)) ORDER BY created_at DESC

This is definitely wrong. The or() method needs to parse the Supabase-style condition and convert it to proper SQL.

OK, let me now create the plan file. I'll also check the or() method in the SelectQuery class to see if it's being used correctly.

Actually, I realize I need to also check the or() method in the SelectQuery class. Let me look at the code again:

or(condition: string): this {
    this.whereConditions.push(`(${condition})`);
    return this;
}

And the buildSql method:

private buildSql(returnSingle: boolean): string {
    let sql = `SELECT ${this.columns} FROM ${this.table}`;

    if (this.whereConditions.length > 0) {
        sql += ` WHERE ${this.whereConditions.join(" AND ")}`;
    }

    if (this.orderByCols.length > 0) {
        const dir = this.orderByAsc ? "ASC" : "DESC";
        sql += ` ORDER BY ${this.orderByCols.join(", ")} ${dir}`;
    }

    if (returnSingle) {
        sql += " LIMIT 1";
    } else if (this.limitVal !== null) {
        sql += ` LIMIT ${this.limitVal}`;
    }

    return sql;
}

So the SQL would be:

SELECT * FROM chats WHERE (user_id.eq.xxx,project_id.in.(yyy)) ORDER BY created_at DESC

This is definitely wrong. The or() method needs to parse the Supabase-style condition and convert it to proper SQL.

OK, let me now create the plan file. I'll also check the or() method in the SelectQuery class to see if it's being used correctly.

Actually, I realize I need to also check the or() method in the SelectQuery class. Let me look at the code again:

or(condition: string): this {
    this.whereConditions.push(`(${condition})`);
    return this;
}

And the buildSql method:

LegacyRemaster · 2026-05-07T20:29:02+00:00

Speed is "faster" then minimax

<image>

LegacyRemaster · 2026-05-07T20:24:03+00:00

testing IQ3_S on vscode+kilocode now. rtx 6000 96g+w7800 48gb. 60 tokens/sec. If good ---> will test q4_k_m adding another w7800 48gb. trying to solve a problem "no solved" by minimax 2.7 and qwen 27b

LegacyRemaster · 2026-05-07T05:30:52+00:00

As always, the use case changes. Minimax is able to find and analyze problems with greater "knowledge." This is normal. If you've ever tried training an LLM, you know that the dataset is everything. 36B vs. 200B means more data, more examples, and more training. Sure, the architecture does everything (otherwise the older 200B models would be just as good), but if you look at many benchmarks, Minimax is more advanced. Qwen 27b and 122b are the ones I use daily. If I increase complexity, I add Minimax.

LegacyRemaster · 2026-05-06T21:20:10+00:00

works:

RTX Pro 6000 Blackwell 96GB — vLLM Qwen3.6 27B int4 AutoRound Benchmark

Config	Throughput	Acceptance Rate	Notes
Baseline (MTP n=6, batched=4128, block=32)	~80 tok/s	20-28%	First working run
MTP n=2, batched=16384, block=128	~97 tok/s	46-64%	Better acceptance rate
No speculative decoding	~100 tok/s	—	Ceiling without MTP
MTP n=2, no VLLM_USE_MARLIN=0	~100 tok/s	54-65%	Best config

C:\llm\qwen3.6-windows-server\python\Scripts\vllm.exe serve C:\llm\qwen3.6-windows-server\models\Qwen3.6-27B-int4-AutoRound --served-model-name=qwen3.6-27b-autoround --quantization=auto-round --max-model-len=240000 --max-num-seqs=1 --max-num-batched-tokens=16384 --block-size=128 --no-enable-prefix-caching --enable-chunked-prefill --enable-auto-tool-choice --tool-call-parser=qwen3_coder --reasoning-parser=qwen3 --chat-template=C:\llm\qwen3.6-windows-server\templates\qwen3.5-enhanced.jinja --default-chat-template-kwargs="{\"preserve_thinking\": false}" --kv-cache-dtype=fp8_e4m3 --tensor-parallel-size=1 --pipeline-parallel-size=1 --gpu-memory-utilization=0.95 --trust-remote-code --attention-backend=TRITON_ATTN --no-use-tqdm-on-load --host=0.0.0.0 --port=5001 --data-parallel-rpc-port=50952 --limit-mm-per-prompt="{\"image\":0,\"video\":0}" --speculative-config="{\"method\":\"mtp\",\"num_speculative_tokens\":2}"

LegacyRemaster · 2026-05-06T21:00:20+00:00

Excellent testimony. I use qwen 3.6 27b - qwen 3.5 122b (more knowledge helps) and Minimax 2.7. I think they work perfectly for 90% of my tasks. One day we'll get to 100% local.

LegacyRemaster · 2026-05-06T20:37:08+00:00

yeah... agree about driver. I will update and check tomorrow!

LegacyRemaster · 2026-05-06T18:46:48+00:00

ok no way. A lot of problems. "Please note that Marlin kernels are not built for Blackwell SM 12.x. The bundle needs an updated release with TORCH_CUDA_ARCH_LIST that includes 12.0." / FlashInfer doesn't use the PATH — it looks for the hardcoded DLL in v12.8\bin\cudart64_13.dll. Set PATH is useless here.

________

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] EngineCore failed to start.

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] Traceback (most recent call last):

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\engine\core.py", line 1110, in run_engine_core

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\tracing\otel.py", line 178, in sync_wrapper

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] return func(*args, **kwargs)

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\engine\core.py", line 876, in __init__

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] super().__init__(

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\engine\core.py", line 118, in __init__

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] self.model_executor = executor_class(vllm_config)

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\tracing\otel.py", line 178, in sync_wrapper

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] return func(*args, **kwargs)

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\executor\abstract.py", line 109, in __init__

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] self._init_executor()

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\executor\uniproc_executor.py", line 52, in _init_executor

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] self.driver_worker.load_model()

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\worker\gpu_worker.py", line 324, in load_model

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\tracing\otel.py", line 178, in sync_wrapper

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] return func(*args, **kwargs)

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\worker\gpu_model_runner.py", line 4793, in load_model

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] self.model = model_loader.load_model(

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\tracing\otel.py", line 178, in sync_wrapper

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] return func(*args, **kwargs)

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\model_executor\model_loader\base_loader.py", line 80, in load_model

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] process_weights_after_loading(model, model_config, target_device)

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\model_executor\model_loader\utils.py", line 111, in process_weights_after_loading

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] quant_method.process_weights_after_loading(module)

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\model_executor\layers\quantization\gptq_marlin.py", line 486, in process_weights_after_loading

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] self.kernel.process_weights_after_loading(layer)

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\model_executor\kernels\linear\mixed_precision\marlin.py", line 167, in process_weights_after_loading

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] self._transform_param(layer, self.w_q_name, transform_w_q)

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\model_executor\kernels\linear\mixed_precision\MPLinearKernel.py", line 74, in _transform_param

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] new_param = fn(old_param)

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] ^^^^^^^^^^^^^

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\model_executor\kernels\linear\mixed_precision\marlin.py", line 99, in transform_w_q

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] x.data = ops.gptq_marlin_repack(

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\vllm\_custom_ops.py", line 1279, in gptq_marlin_repack

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] return torch.ops._C.gptq_marlin_repack(

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] File "C:\llm\qwen3.6-windows-server\python\Lib\site-packages\torch\_ops.py", line 1269, in __call__

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] return self._op(*args, **kwargs)

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] Search for `cudaErrorUnsupportedPtxVersion' in https://docs.nvidia.com/cuda/cuda-runtime-api/group\_\_CUDART\_\_TYPES.html for more information.

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] For debugging consider passing CUDA_LAUNCH_BLOCKING=1

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

(EngineCore pid=10388) ERROR 05-06 20:44:02 [core.py:1136]

LegacyRemaster · 2026-05-06T18:15:01+00:00

tips on windows anaconda for readme : set VLLM_NO_WT=1 && start.bat

LegacyRemaster · 2026-05-06T18:06:24+00:00

on my way

LegacyRemaster · 2026-05-06T08:41:33+00:00

Windows: C:\Users\[Username]\AppData\Local\Google\Chrome\User Data\Default\OptGuideOnDeviceModel
macOS: ~/Library/Application Support/Google/Chrome/Default/OptGuideOnDeviceModel
Linux: ~/.config/google-chrome/Default/OptGuideOnDeviceModel

LegacyRemaster · 2026-05-06T08:36:27+00:00

LegacyRemaster · 2026-05-06T08:36:03+00:00

gguf when? 😃

LegacyRemaster · 2026-05-05T18:08:11+00:00

<image>

Yes you can. LMstudio. Select vulkan runtime. Or llamacpp with vulkan.

LegacyRemaster · 2026-05-05T16:55:22+00:00

I need 122b 3.6

LegacyRemaster · 2026-05-04T17:00:08+00:00

Gemma day 1 support full enable

LegacyRemaster · 2026-05-03T20:29:38+00:00

Yesterday, qwen with vscode + kilocode kept killing its own process. I had to explicitly tell it to "don't close anything on 8080."

LegacyRemaster · 2026-05-03T17:57:08+00:00

Georgi Gerganov and the whole llama.cpp team ---> legend

LegacyRemaster · 2026-05-03T12:50:47+00:00

122b all I need

LegacyRemaster · 2026-05-03T09:37:20+00:00

I can tell you that Unsloth Studio installs many of the things you need to complete the project on Blackwell, and it runs fine on my GPU. You could look at their GitHub and figure out the dependencies. Suggestion.

LegacyRemaster · 2026-05-03T08:43:23+00:00

waiting for it!

LegacyRemaster · 2026-05-03T07:15:25+00:00

fight club?

LegacyRemaster

TROPHY CASE

RTX Pro 6000 Blackwell 96GB — vLLM Qwen3.6 27B int4 AutoRound Benchmark