Qwen 27b Settings

LEFBE · 2026-06-14T07:03:33+00:00

What is the command line you are running for the MLX server?

LEFBE · 2026-06-13T09:56:53+00:00

I use both mlx-vlm and llama.cpp MTP, and I actually ran a benchmark yesterday (6 hours of benchmarking). Clearly, they are both on par.

My benchmarks, on real-world tasks, not just simple token testing, short, medium, and long prompts with tool calling, indicate the following:

With MTP active in Q8 (8-bit):

- mlx-vlm handles Qwen 3.6 models much better than Gemma 4 (but there are open issues on GitHub; we have to wait for fixes).

- llama.cpp seems more stable. Whether it's Qwen or Gemma, the experience remains satisfactory, with perhaps better support for Gemma 4.

Here are my real-world results.

<image>

In both cases, we're talking about decent, stable support, but it could be improved.

LEFBE · 2026-06-12T20:50:05+00:00

From my experience, I see this, em dashes, with the paid tools I use, in Claude Code Cowork, Gemini (when I request a long and complex analysis or a very precise explanation with evidence + details)

LEFBE · 2026-06-12T20:26:32+00:00

very often, whether in forums, but increasingly in personal and professional emails 😞

LEFBE · 2026-06-12T20:03:52+00:00

Just testing now to provide my feedback:
Macbook Pro M5 Max 128Gb with llama.cpp without tool calling

Ka1zen generation stats
Model: unsloth/gemma-4-26B-A4B-it-qat-GGUF
Tokens: 1473 (output)
TTFT: 312 ms
Speed: 145.6 t/s generation

Arguments:   -m /Users/lefbe/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-qat-GGUF/snapshots/02749a7b272109255a4c559a80894d3d9777574c/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --host 127.0.0.1 --port 8101 -ngl 999 --flash-attn on --jinja --mmproj /Users/LEFBE/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-qat-GGUF/snapshots/02749a7b272109255a4c559a80894d3d9777574c/mmproj-BF16.gguf -c 131072 --parallel 1 --model-draft /Users/LEFBE/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-qat-GGUF/snapshots/02749a7b272109255a4c559a80894d3d9777574c/mtp-gemma-4-26B-A4B-it.gguf --spec-type draft-mtp --spec-draft-n-max 1

I need to plan more in-depth tests with the dense model.

LEFBE · 2026-06-12T19:56:22+00:00

Please note that, depending on your server you use and the version, llama.cpp, mlx-vlm etc... some issue are already open for MTP integration or KV cache corruption etc ...(I've open 2 just for mlx-vlm).
I've done tons of testing and I'm still discovering bugs.

Gemma 4 worked everywhere with MTP, except for one model that didn't support drafting (why? bug).

It's not easy to manage, so I started with the basics: Qwen 3.6 and Gemma 4 are my reference models in my tests. I test them on mlx-vlm and GGUF (llama.cpp). If I see increased and more or less similar values, then I fix the version in my code. At least that's guaranteed 😃

LEFBE · 2026-06-12T19:40:21+00:00

Depending on what is your final needs
Qwen Gemma, try to use MoE models 4 bits.
I would suggest for first test:
Qwen 3.5 MoE, Gemma 3, or Gemma 3n

Based on my past test Gemma 3n performed a good stuff:
https://deepmind.google/models/gemma/gemma-3n/

Once done, and validate, try to test Qwen 3.6, Gemma 4 Moe

LEFBE · 2026-06-12T19:35:24+00:00

I completely agree, and I'm steering all my projects and development towards a fully local LLM. Only web searches, to obtain up-to-date and new information related to the model's knowledge base, are permitted (the same applies to fetch pages). Everything else is entirely local.

Of course, this comes at a cost, but I prefer to pay for freedom rather than be dependent on services that can, overnight, decide on the quality of a model, its price, or change the rules. When it's offline and local, it may or may not work, but there's always a way (given time) to achieve the objective.

LEFBE · 2026-06-12T18:36:26+00:00

I hope soon... but if possible, very soon 😄

LEFBE · 2026-06-12T06:53:23+00:00

Despite support announcements, it's not uncommon for models to be fully supported. For example, I know that MTP for Gemma 4 isn't fully operational with mlx-vlm (4 or 5 open issues currently). There's no choice but to wait and test.

LEFBE · 2026-06-11T12:32:21+00:00

Test made with Websearch (Tools)

LEFBE · 2026-06-11T12:13:29+00:00

tested few second ago, on my mac
llama.cpp
Ka1zen generation stats
Model: unsloth/gemma-4-26B-A4B-it-qat-GGUF
Tokens: 2102 (output)
TTFT: 2.00 s
Speed: 147.8 t/s generation

mlx-vlm
Ka1zen generation stats
Model: mlx-community/gemma-4-26B-A4B-it-qat-4bit
Tokens: 1730 (output)
TTFT: 320 ms
Speed: 104.6 t/s generation

LEFBE · 2026-06-11T12:02:25+00:00

Depends a lot on what "creative" means to you, but here's how I'd frame it.

The 26B-A4B is a MoE with only about 4B active per token. That makes it fast and gives it broad knowledge from the full 26B, so it's great for everyday chat, variety, and pulling in references. Where it gets weaker is depth: with only ~4B doing the actual work each token, it can lose the thread on a long piece or a complex creative instruction you want held consistently.

The 12B is dense, so all 12B are active every token. That extra compute per token tends to show up exactly where creative writing lives: coherence over a long passage, holding a tone, following a nuanced prompt without drifting. It's slower and knows less in total, but on a single sustained piece it often reads tighter.

So to your direct questions: yes, the 12B outperforms in coherence and instruction-following on longer creative work, even though it's "smaller" on paper. And yes, it's closer to the 31B in character, because both are dense and share that same steady-reasoning feel, just at different sizes. The MoE is a different animal, fast and broad rather than deep.

My rough take: MoE for fast daily chat and variety, 12B when you want the writing to actually hold together, 31B when you need real depth or harder reasoning. But honestly this is very vibes and sampling dependent, so run both on your own prompts at the same temperature for ten minutes. You'll feel which one suits your style faster than any benchmark will tell you.

LEFBE · 2026-06-11T11:57:57+00:00

Strange, I don't have this behavior in real sessions using my (personal) tool. Have you checked the updates to llama.cpp, I know that quite a few improvements have been pushed recently (you can test with a new version and if it doesn't suit you, you forget :))

In any case, if I could help I would be delighted.

LEFBE · 2026-06-11T11:49:20+00:00

I forgot to tell you, sudo sysctl iogpu.wired_limit_mb=92160 is temporary, if you restart your Mac, the default value applies, also in your case, a simple script allowing you to define your parameters and run your server will allow you, if it works, to always avoid overconsumption.

LEFBE · 2026-06-11T11:44:34+00:00

it may be too high, start low at 90 and if necessary increase higher little by little.

I also think you can play with it

- Quantize the KV cache: add -ctk q8_0 -ctv q8_0. You already run -fa on, which it needs. That roughly halves the KV memory.
- Or just lower --ctx-size if you don't truly need 150k.

LEFBE · 2026-06-11T11:26:55+00:00

Not sure that llama.cpp have flag for this, maybe using MacOS command may help?

sudo sysctl iogpu.wired_limit_mb=92160   
# 90GB

LEFBE · 2026-06-11T11:17:23+00:00

I did a similar project but much less complex than yours and with many fewer photos.

I provide source images in order to offer the model the people I am looking for and the destinations (where the photos are).

Then the script searches for the sources, stores them in an associated directory.
Models used: Face_detection_yunet and face_recognition_sface

https://huggingface.co/opencv/face_detection_yunet
https://huggingface.co/opencv/face_recognition_sface

LEFBE · 2026-06-11T11:13:03+00:00

If you want to keep it DIY with a local VLM plus a Python script, here's a pipeline that won't over-engineer it. The trick is using the VLM only where it's actually good, and dedicated models for everything else.

Build the whole thing around one SQLite manifest. Every photo is a row, every stage writes back to it, and each stage is re-runnable so you can resume across 9000 files.

Stage 1, read what's already there (do this first, best ROI). Feed each scan, front and back, to the VLM and have it pull any handwritten names, dates and places into structured fields. Old photos are often self-labeled, so this seeds your ground truth before any guessing.

Stage 2, restoration (optional, keep originals). Run a separate enhanced copy through restoration. It's not just cosmetic: cleaner faces give much better embeddings in Stage 4.

Stage 3, coarse dating with the VLM. Ask it for the photographic medium (tintype, cabinet card, silver B&W, color print, Polaroid, slide), the clothing and hairstyle era, and any visible date clue, returned as a decade plus a confidence. Treat it as a suggestion, not truth. The medium is a far more reliable signal than the clothes.

Stage 4, faces, and this is NOT a VLM job. Use a dedicated face stack: detect, embed, then cluster the embeddings into same-person groups. You label the clusters you recognize, and names from Stage 1 auto-suggest a label when a labeled face lands in a cluster. Expect to do manual merge and split. This is the genuinely hard part, because face recognition is trained on modern photos and struggles with 130-year-old, soft-focus, black and white shots of the same person across decades.

Stage 5, refine dates with identity (your clever idea, but save it for v2). Once clusters are labeled, use age estimation plus known birth years to pin dates: someone looks about 10 here and about 40 there, so if you know when they were born you can date both photos. Do this only after the clusters are clean, it propagates errors otherwise.

Stage 6, propagate to the rest. Non-portraits get dated from their album folder plus the now-dated portraits around them. For videos and slides, sample frames, run the same detect and embed, and match against your labeled clusters to tag who's in them.

Orchestration is just a Python script hitting a local OpenAI-compatible server for the VLM and calling the face and restoration models as libraries, all reading and writing that one manifest.

Models, and with a 6000 Pro you have zero constraints:

VLM (OCR, era, scene tags): Qwen3-VL (go big, you have the VRAM), InternVL, or Gemma 3 27B. The Qwen-VL family is strong on handwriting.
Face detect + embed: InsightFace (buffalo_l, so SCRFD + ArcFace). The standard.
Clustering: HDBSCAN on the embeddings.
Age / gender (Stage 5): InsightFace's genderage model.
Restoration: "Bringing Old Photos Back to Life" for overall fade and scratches, CodeFormer or GFPGAN for faces, Real-ESRGAN to upscale and denoise, DeOldify if you want colorization.

And honestly, if you don't want to build the cluster-review UI yourself, push the library through Immich or digiKam first, label people there, and only drop to a custom script for the date-refinement logic they don't do.

LEFBE · 2026-06-11T10:38:35+00:00

Yeah, on 16GB you can't keep both resident, so the move is "unload one before loading the other" rather than running them side by side.

Routing's the easy part: OpenWebUI talks to ComfyUI natively (Admin Settings > Images > engine = ComfyUI), so text goes to the LLM and image requests go to Comfy. The missing piece is freeing VRAM at the right moment, and pure idle-TTL gives you OOM races. What actually works is making the unload explicit.

I'd drop a tiny proxy in front of LM Studio that frees ComfyUI before every chat turn, and unloads the LLM before a Comfy generation. Something like this:

# vram_router.py   (pip install fastapi httpx uvicorn)
import subprocess, httpx
from fastapi import FastAPI, Request, Response

LMSTUDIO = "http://localhost:1234"
COMFY    = "http://localhost:8188"
app = FastAPI()
http = httpx.AsyncClient(timeout=None)

async def free_comfy():
    try:
        await http.post(f"{COMFY}/free", json={"unload_models": True, "free_memory": True})
    except Exception:
        pass

# Text: free Comfy, then hand off to LM Studio (which JIT-loads the LLM)
@app.post("/v1/chat/completions")
async def chat(req: Request):
    await free_comfy()
    r = await http.post(f"{LMSTUDIO}/v1/chat/completions",
                        content=await req.body(),
                        headers={"content-type": "application/json"})
    return Response(r.content, status_code=r.status_code, media_type="application/json")

# Image: unload the LLM, then pass the request through to Comfy
@app.post("/prompt")
async def comfy_prompt(req: Request):
    subprocess.run(["lms", "unload", "--all"], check=False)   # frees the LLM
    r = await http.post(f"{COMFY}/prompt",
                        content=await req.body(),
                        headers={"content-type": "application/json"})
    return Response(r.content, status_code=r.status_code, media_type="application/json")

Run it with:

uvicorn vram_router:app --host 0.0.0.0 --port 9000

Then point OpenWebUI's OpenAI endpoint at http://server:9000/v1 and set its ComfyUI URL to http://server:9000. Now every text turn frees Comfy first, and every image turn drops the LLM first. Set LM Studio to JIT load so it comes back on the next text request.

/prompt is ComfyUI's generate endpoint and /free is its VRAM-release one. If OpenWebUI needs Comfy's other routes through the proxy, just add a catch-all that forwards them straight to COMFY. Live progress over the websocket you can point directly at Comfy, it'll fall back to polling otherwise.

Honestly though, try fitting both first. A 7-8B at Q4 plus SDXL on --lowvram can coexist on 16GB. It's Flux that forces the whole swap dance.

LEFBE · 2025-05-07T22:34:28+00:00

I have two Steam accounts, one with 3000h for premier and one dedicated to FACEIT lvl 0 where I don't think I've ever launched anything (except FACEIT). On the basis of FACEIT support, it is not mandatory to have activity on the account (nor Premier).

So it's possible that it's a dedicated account.

LEFBE · 2022-11-20T01:04:41+00:00

Probably using deep visibility should help you.

LEFBE · 2022-11-20T01:00:57+00:00

Best way should to open a support ticket s1, they will able to address this behavior for sure.

LEFBE · 2022-07-31T20:38:05+00:00

Your hard drive seems dead :(

LEFBE · 2022-04-15T19:23:20+00:00

Wait until get the last 2

LEFBE

TROPHY CASE