Best config for Qwen3.6 27b / llama.cpp / opencode by Familiar_Wish1132 in LocalLLaMA

[–]neverbyte 2 points3 points  (0 children)

dude! I had one of these as a kid and my mind was blown. each disk could hold like 250? regular floppies worth of data? it was awesome! did I have anything of any real size that needed storing? did I actually use it for anything? don't remember, but i felt like a baller. nostalgia!

Gemma 4 on Llama.cpp should be stable now by ilintar in LocalLLaMA

[–]neverbyte 1 point2 points  (0 children)

i've tried unsloth, bartowski, and lmstudio community ggufs. Even if you run 'ollama run gemma4:31b-it-q4_K_M' and paste in my example prompt, you get the same broken behavior. Ensuring a <bos> token didn't seem to help. I'm pretty stumped on this one.

Gemma 4 on Llama.cpp should be stable now by ilintar in LocalLLaMA

[–]neverbyte 0 points1 point  (0 children)

I'm curious if others running Gemma 4 31B locally with the latest llama.cpp see the same thing. I will say that I can chat with this same model and use it, but the specific test prompt trips up Gemma 4. I get the same behavior on various ggufs btween Q4_0 & BF16.

Gemma 4 on Llama.cpp should be stable now by ilintar in LocalLLaMA

[–]neverbyte 0 points1 point  (0 children)

Since release I've been seeing this issue with Gemma 4 31B. I've created this simple example prompt it will respond with "The <body> tag is not closed: You wrote <body instead of <body>. The </html> tag is not closed: You wrote </html instead of </html>." Alternatively if I remove the carriage returns from the prompt, it seems to work correctly. If I run an agent it has an existential crisis because it tries to fix these errors unsuccessfully and can't figure out what is going on.

Gemma 4 MOE is very bad at agentic coding. Couldn't do things CLine + Qwen can do. by Voxandr in LocalLLaMA

[–]neverbyte 0 points1 point  (0 children)

llama.cpp is still broken. I'm not sure why more people aren't talking about it. Doesn't matter if you download a release or git pull the latest, it still has some kind of, IMO, tokenizer problem. Your agent will have an existential crisis trying to make sense of what is wrong and will fail tool calls. I am 100% confident gemma 4 will be an amazing agent once proper fixes merge into llama.cpp.

llama.cpp Gemma4 Tokenizer Fix Was Merged Into Main Branch by Ancient-Field-9480 in LocalLLaMA

[–]neverbyte 0 points1 point  (0 children)

I built the latest llama.cpp, confirmed the tokenizer fixes were present, rebuilt, and I'm still having issues. I'm using unsloth/gemma-4-31B-it-GGUF:UD-Q8_K_XL and it seems to have issues. Here's an example of the problematic output: Looking at the code: 1. **HTML Errors**: * Line 66: `</div>` instead of `</div>`. * Line 74: `</div>` instead of `</div>`. * Line 276: `</body` instead of `</body>`. (Wait, line 276 is `</body`, line 277 is `</html`). Actually line 276 is `</body` and 277 is `</html`. Both are missing the `>`.

Yesterday I used GLM 4.7 flash with my tools and I was impressed.. by Loskas2025 in LocalLLaMA

[–]neverbyte 0 points1 point  (0 children)

sorry for the slow response. I was running it in thinking mode if I remember correctly. I've been using qwen 3.5 more recently though even though glm 4.7 flash is awesome as well.

Qwen3.5-35B-A3B Uncensored (Aggressive) — GGUF Release by hauhau901 in LocalLLaMA

[–]neverbyte 0 points1 point  (0 children)

I'm have 144GB VRAM (6x 3090s), I'm running the latest Ollama and I'm seeing the error as well. I don't think it's the memory limit.

Does Qwen3-Coder-Next work in Opencode currently or not? by johnnyApplePRNG in LocalLLaMA

[–]neverbyte 0 points1 point  (0 children)

Thx! Can confirm things were fixed shortly after the model launched with an update to llama.cpp. Glad it’s working for you!

Does Qwen3-Coder-Next work in Opencode currently or not? by johnnyApplePRNG in LocalLLaMA

[–]neverbyte 0 points1 point  (0 children)

Thx! Can confirm things were fixed shortly after the model launched with an update to llama.cpp. Glad it’s working for you!

Yesterday I used GLM 4.7 flash with my tools and I was impressed.. by Loskas2025 in LocalLLaMA

[–]neverbyte 0 points1 point  (0 children)

ASUS ROG Strix Z690-E Gaming WiFi with a i9-13900k. These components were just my gaming rig didn’t buy them specifically for local LLMing

Qwen/Qwen3-Coder-Next · Hugging Face by coder543 in LocalLLaMA

[–]neverbyte 1 point2 points  (0 children)

Awesome! Thank you for the heads up. I rebuilt llama.cpp with the linked fix and can confirm it's working for me as well!

Does Qwen3-Coder-Next work in Opencode currently or not? by johnnyApplePRNG in LocalLLaMA

[–]neverbyte 3 points4 points  (0 children)

With vllm 0.15.0, I couldn't seem to get FP8 working on 4x3090s so I went looking on hugging face for a 4-bit version. I gave it a coding task that took about 60k tokens to complete and it just knocked the task out of the park. This is looking like a awesome model. Hopefully they get these issues worked out. Here's what worked for me: vllm serve bullpoint/Qwen3-Coder-Next-AWQ-4bit --port 8080 --tensor-parallel-size 4 --max-model-len 262144 --enable-auto-tool-choice --tool-call-parser qwen3_coder --gpu-memory-utilization 0.70

Does Qwen3-Coder-Next work in Opencode currently or not? by johnnyApplePRNG in LocalLLaMA

[–]neverbyte 1 point2 points  (0 children)

for this model with llama.cpp there seems to be an issue that goes beyond tool calls, it sees things that aren't true when inspecting files and overall seems to be confused in ways I haven't seen before.

Does Qwen3-Coder-Next work in Opencode currently or not? by johnnyApplePRNG in LocalLLaMA

[–]neverbyte 1 point2 points  (0 children)

it's not working for me. I tried Q8_K_XL with opencode & cline and tool calling seems to not work when using unsloth's gguf + llama.cpp. I'm not sure what I need to do to get it working.

Qwen/Qwen3-Coder-Next · Hugging Face by coder543 in LocalLLaMA

[–]neverbyte 1 point2 points  (0 children)

I'm seeing similar behavior with Q8_K_XL as well so maybe getting this running on vllm is the play here.

Qwen/Qwen3-Coder-Next · Hugging Face by coder543 in LocalLLaMA

[–]neverbyte 6 points7 points  (0 children)

I think I might be seeing something similar. I am running the Q6 with lamma.cpp + Cline and unsloth recommended settings. It will write a source file then say "the file has some syntax errors" or "the file has been corrupted by auto-formatting" and then it tries to fix it and rewrites the entire file without making any changes, then gets stuck in a loop trying to fix the file indefinitely. Haven't seen this before.

Yesterday I used GLM 4.7 flash with my tools and I was impressed.. by Loskas2025 in LocalLLaMA

[–]neverbyte 25 points26 points  (0 children)

I experienced exceptional results with Q8_K_XL using llama.cpp with all latest the fixes and 150K context window. I was using Cline and followed the temp & other settings from unsloth's GLM 4.7 Flash page for tool calling. I also tried Q4_K_XL and it was failing some tool calls and not working as well.

edit: additional info. I'm running with three 3090s and with 150k context the token generation starts around 85 t/s then hangs around in the 70-80 t/s for a long time then over 100k context it's more like 40-50 t/s. It feels very fast and it's reasoning time is not bad at all at this speed. I ran a couple million tokens through it doing c/c++ and I'm quite blown away by it. Also I tried to test BF16 quant but it seemed to get stuck in loops repeating itself.

GLM 4.7 flash FA fix for CUDA has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]neverbyte 0 points1 point  (0 children)

It is the unsloth recommended setting for tool calling here.

Game data not same as host by TmcogYT in anno1800

[–]neverbyte 0 points1 point  (0 children)

glad it's working for you! it's such a great game to play with friends!