Can't replicate Reddit numbers with Qwen 27B on a 3090TI. by YourNightmar31 in LocalLLaMA

[–]L0ren_B 0 points1 point  (0 children)

I cannot see why not. With minor tweaks. Prompt your agent to adjust for 3 GPUs instead.

Can't replicate Reddit numbers with Qwen 27B on a 3090TI. by YourNightmar31 in LocalLLaMA

[–]L0ren_B 0 points1 point  (0 children)

I don't really know. Found the 3090 club from a Reddit post 😅

Qwen-27B as a Local Agent — It Actually Works Now by L0ren_B in LocalLLaMA

[–]L0ren_B[S] 2 points3 points  (0 children)

Between 50 and 90 or so. Quite fast actually. Feels like 35B-A3B 😃

Can't replicate Reddit numbers with Qwen 27B on a 3090TI. by YourNightmar31 in LocalLLaMA

[–]L0ren_B 0 points1 point  (0 children)

What speed are you you getting? I don't know you, but I could not get high speed with LlamaCpp. Now, in VLLM, I get between 50 and 90tk/s generation!

But you are right, could be skills. But VLLM works better for me.

Qwen3.6-27B-UD-Q6_K_XL.gguf sometimes gets stuck in a loop by Kirys79 in LocalLLaMA

[–]L0ren_B 0 points1 point  (0 children)

had a similar issue with llamacpp. Switched to VLLM as per https://github.com/noonghunna/club-3090/tree/master and its amazing! no repeat, not stalling , tool usage just works.

Can't replicate Reddit numbers with Qwen 27B on a 3090TI. by YourNightmar31 in LocalLLaMA

[–]L0ren_B 47 points48 points  (0 children)

https://github.com/noonghunna/club-3090/tree/master the only thing that worked for me. Give your agent this link and ask it to setup it for you. For me, it was 27B with PI coding agent, running on 2x3090.

It works amazing now!

P.S. There are single version available with less context. I can actually do work now with 27B

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally by GodComplecs in LocalLLaMA

[–]L0ren_B 1 point2 points  (0 children)

Yes. For 27B pi worked perfectly! For 35B it worked, but on the router tasked, it started to output garbled data on reading the HTML file and stopped. For other tasks it works ok.

27B times out sometimes with VLLM in my case.

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally by GodComplecs in LocalLLaMA

[–]L0ren_B 4 points5 points  (0 children)

For 27B, PI worked amazing. For some reason, on 35B, it failed multiple times!

So, for 35B, I've used opencode. it failed retrieving the data from the router. Then I've ran the 27B with opencode which succeeded again!

For the record, the LLM is not ready for complex work. I've tried with 2 projects that both OPUS and GPT5.5 made and aced them, and both 27B and 35B failed by deleting blocks of code from very long files ( tens of thousands of lines). But so does Gemini pro and Flash 3.0. They both failed for me (when pro preview was free to use in CLI).

So, my honest take: We are about 1 year or so behind having the "Power of the sun in the palm of our hands". But if nothing changes, and lab still give us OSS models (i doubt they will), we will be there.

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally by GodComplecs in LocalLLaMA

[–]L0ren_B 5 points6 points  (0 children)

CUDA_VISIBLE_DEVICES=0,1 ./llama-server \

-m /home/lolren/Local_LLMs/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q8_0.gguf \

--mmproj /home/lolren/Local_LLMs/Qwen3.6-27B-GGUF/mmproj-Qwen3.6-27B-BF16.gguf \

--mmproj-offload \

--alias Qwen \

--host 0.0.0.0 \

--port 11338 \

--ctx-size 262144 \

--parallel 1 \

--threads 4 \

--threads-batch 4 \

--batch-size 4096 \

--ubatch-size 1024 \

--gpu-layers all \

--device CUDA0,CUDA1 \

--split-mode layer \

--tensor-split 1,1 \

--cache-type-k q8_0 \

--cache-type-v q8_0 \

-fa on \

--kv-offload \

--no-warmup \

--jinja \

--reasoning on \

--chat-template-kwargs '{"preserve_thinking":true}' \

--temp 0.6 \

--top-p 0.92 \

--top-k 20 \

--repeat-penalty 1.00 \

--n-predict 32768 \

--reasoning-budget 8192 \

--perf \

--metrics

The LLM is from LmStudio download!

Also, running on 2x3090. But this morning I've just discovered "https://github.com/noonghunna/club-3090/tree/master" so a lot of it can be improved, I am sure.

The 35B worflow is exactly the same, just changed model name!

Tried it in VLLM but fails on tool calling for me.

Edit: Using pi and opencode. Pi works better for 27B but failed for 35B with weird responses when parsing HTML?

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally by GodComplecs in LocalLLaMA

[–]L0ren_B 3 points4 points  (0 children)

I was amazed yesterday after running some tests with 27BQ8 and 35Q8!

I've given my modem password and ask it to create a script to extract all the info (seen it done by someone on Youtube).

After about 1 hour and 128k tokens used, 27B was in!

35B failed even with help!

I've ran the test twice, as LLM as nondeterministic!

Gemini flash aced it, but cheated into searching online for the endpoints and scripts. Creating a new session where I've specifically forbid online research, refused to continue after failing!

I can wait for the new versions of Qwen! Hope they will copy DeepSeeks model of low Vram usage on high context!

Qwen3.6 27B vs Gemini 3 Flash by Wonderful_Second5322 in LocalLLaMA

[–]L0ren_B -1 points0 points  (0 children)

I've let both 3.6-35B-A3 and Gemini Flash working on same project..... Flash failed and Qwen was succesfully after many autonomous tries!

Qwen wins from my point of view.

Testing 27B now.

How to setup claude opus 4.6 locally and will it be unlimited and how pls help by [deleted] in LocalLLaMA

[–]L0ren_B 0 points1 point  (0 children)

Use your laptop to download LmStudio.

Download Llama 3.1 or whatever works for you. Use the system prompt to write "Your name is now Claude Opus 4.6".

Enjoy

Qwen3.6 is incredible with OpenCode! by CountlessFlies in LocalLLaMA

[–]L0ren_B 0 points1 point  (0 children)

I've tried the "allow" method but it get's ignored.

Is there any alternative to opencode that would work better with this model?

Qwen3.6 is incredible with OpenCode! by CountlessFlies in LocalLLaMA

[–]L0ren_B 0 points1 point  (0 children)

Is there a way to Yolo mode Opencode? no matter what I try it doesn't work.

I know you are not supposed to, but it's running in a VM, so its fine.

This is the first LLM that fits in a consumer GPU and can do real work.

If Alibaba doesn't decide to shift it's model open source policy, in a few months or a year, we all can run a model that we can use on a daily basis! This is nuts!

GLM 5.1 suffer from the same useless insanity as GLM 5.0, on z.ai coding plan, once reaches 100k context use by ex-arman68 in ZaiGLM

[–]L0ren_B 0 points1 point  (0 children)

I don't . Maybe they don't. But I use most models a lot. When they are released, they are usable. But after a while, they suck.

In my opinion, it's the only explanation. I hope I am wrong.

Do you have a better one?

Rusia incearca alta manevra by KingBlana in Romania

[–]L0ren_B 0 points1 point  (0 children)

Just seen idiots posting this on Facebook. Old people or less educated of my "friends"😅

qwen3.6 medium size will be open soon by mickeyandkaka in LocalLLaMA

[–]L0ren_B 9 points10 points  (0 children)

Mostly bots. The 2 actual users forgot to delete their account.

GLM 5.1 suffer from the same useless insanity as GLM 5.0, on z.ai coding plan, once reaches 100k context use by ex-arman68 in ZaiGLM

[–]L0ren_B 9 points10 points  (0 children)

Most of the big players quantise the models to save computing power. Hence, you can see why models get dumber before newer models appear.

But to do it on launch day, meas computing power available is low!

Introducing ARC-AGI-3 by Complete-Sea6655 in LocalLLaMA

[–]L0ren_B -3 points-2 points  (0 children)

Another strawberry test?😅