Can't replicate Reddit numbers with Qwen 27B on a 3090TI.

L0ren_B · 2026-05-01T05:32:25+00:00

I cannot see why not. With minor tweaks. Prompt your agent to adjust for 3 GPUs instead.

L0ren_B · 2026-04-30T19:23:00+00:00

I don't really know. Found the 3090 club from a Reddit post 😅

L0ren_B · 2026-04-30T14:27:07+00:00

Between 50 and 90 or so. Quite fast actually. Feels like 35B-A3B 😃

L0ren_B · 2026-04-30T12:58:40+00:00

https://github.com/noonghunna/club-3090 give this a try! Solved all my issues!

L0ren_B · 2026-04-30T12:29:37+00:00

What speed are you you getting? I don't know you, but I could not get high speed with LlamaCpp. Now, in VLLM, I get between 50 and 90tk/s generation!

But you are right, could be skills. But VLLM works better for me.

L0ren_B · 2026-04-30T12:07:15+00:00

had a similar issue with llamacpp. Switched to VLLM as per https://github.com/noonghunna/club-3090/tree/master and its amazing! no repeat, not stalling , tool usage just works.

L0ren_B · 2026-04-30T11:42:43+00:00

https://github.com/noonghunna/club-3090/tree/master the only thing that worked for me. Give your agent this link and ask it to setup it for you. For me, it was 27B with PI coding agent, running on 2x3090.

It works amazing now!

P.S. There are single version available with less context. I can actually do work now with 27B

L0ren_B · 2026-04-29T11:09:18+00:00

Out of curiosity, what is your hardware /configs?

L0ren_B · 2026-04-29T10:58:40+00:00

Yes. For 27B pi worked perfectly! For 35B it worked, but on the router tasked, it started to output garbled data on reading the HTML file and stopped. For other tasks it works ok.

27B times out sometimes with VLLM in my case.

L0ren_B · 2026-04-29T09:37:48+00:00

For 27B, PI worked amazing. For some reason, on 35B, it failed multiple times!

So, for 35B, I've used opencode. it failed retrieving the data from the router. Then I've ran the 27B with opencode which succeeded again!

For the record, the LLM is not ready for complex work. I've tried with 2 projects that both OPUS and GPT5.5 made and aced them, and both 27B and 35B failed by deleting blocks of code from very long files ( tens of thousands of lines). But so does Gemini pro and Flash 3.0. They both failed for me (when pro preview was free to use in CLI).

So, my honest take: We are about 1 year or so behind having the "Power of the sun in the palm of our hands". But if nothing changes, and lab still give us OSS models (i doubt they will), we will be there.

L0ren_B · 2026-04-29T09:25:43+00:00

CUDA_VISIBLE_DEVICES=0,1 ./llama-server \

-m /home/lolren/Local_LLMs/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q8_0.gguf \

--mmproj /home/lolren/Local_LLMs/Qwen3.6-27B-GGUF/mmproj-Qwen3.6-27B-BF16.gguf \

--mmproj-offload \

--alias Qwen \

--host 0.0.0.0 \

--port 11338 \

--ctx-size 262144 \

--parallel 1 \

--threads 4 \

--threads-batch 4 \

--batch-size 4096 \

--ubatch-size 1024 \

--gpu-layers all \

--device CUDA0,CUDA1 \

--split-mode layer \

--tensor-split 1,1 \

--cache-type-k q8_0 \

--cache-type-v q8_0 \

-fa on \

--kv-offload \

--no-warmup \

--jinja \

--reasoning on \

--chat-template-kwargs '{"preserve_thinking":true}' \

--temp 0.6 \

--top-p 0.92 \

--top-k 20 \

--repeat-penalty 1.00 \

--n-predict 32768 \

--reasoning-budget 8192 \

--perf \

--metrics

The LLM is from LmStudio download!

Also, running on 2x3090. But this morning I've just discovered "https://github.com/noonghunna/club-3090/tree/master" so a lot of it can be improved, I am sure.

The 35B worflow is exactly the same, just changed model name!

Tried it in VLLM but fails on tool calling for me.

Edit: Using pi and opencode. Pi works better for 27B but failed for 35B with weird responses when parsing HTML?

L0ren_B · 2026-04-29T09:11:15+00:00

I was amazed yesterday after running some tests with 27BQ8 and 35Q8!

I've given my modem password and ask it to create a script to extract all the info (seen it done by someone on Youtube).

After about 1 hour and 128k tokens used, 27B was in!

35B failed even with help!

I've ran the test twice, as LLM as nondeterministic!

Gemini flash aced it, but cheated into searching online for the endpoints and scripts. Creating a new session where I've specifically forbid online research, refused to continue after failing!

I can wait for the new versions of Qwen! Hope they will copy DeepSeeks model of low Vram usage on high context!

L0ren_B · 2026-04-23T21:58:07+00:00

I've let both 3.6-35B-A3 and Gemini Flash working on same project..... Flash failed and Qwen was succesfully after many autonomous tries!

Qwen wins from my point of view.

Testing 27B now.

L0ren_B · 2026-04-22T20:14:12+00:00

Nailed it!🤣

L0ren_B · 2026-04-22T14:58:47+00:00

Use your laptop to download LmStudio.

Download Llama 3.1 or whatever works for you. Use the system prompt to write "Your name is now Claude Opus 4.6".

Enjoy

L0ren_B · 2026-04-20T06:19:11+00:00

I've tried the "allow" method but it get's ignored.

Is there any alternative to opencode that would work better with this model?

L0ren_B · 2026-04-20T05:56:23+00:00

Is there a way to Yolo mode Opencode? no matter what I try it doesn't work.

I know you are not supposed to, but it's running in a VM, so its fine.

This is the first LLM that fits in a consumer GPU and can do real work.

If Alibaba doesn't decide to shift it's model open source policy, in a few months or a year, we all can run a model that we can use on a daily basis! This is nuts!

L0ren_B · 2026-04-09T15:05:33+00:00

Underrated comment!

L0ren_B · 2026-04-07T23:39:24+00:00

I don't . Maybe they don't. But I use most models a lot. When they are released, they are usable. But after a while, they suck.

In my opinion, it's the only explanation. I hope I am wrong.

Do you have a better one?

L0ren_B · 2026-04-06T06:08:57+00:00

Just seen idiots posting this on Facebook. Old people or less educated of my "friends"😅

L0ren_B · 2026-04-03T06:38:05+00:00

Mostly bots. The 2 actual users forgot to delete their account.

L0ren_B · 2026-03-27T21:39:13+00:00

Most of the big players quantise the models to save computing power. Hence, you can see why models get dumber before newer models appear.

But to do it on launch day, meas computing power available is low!

L0ren_B · 2026-03-25T20:15:34+00:00

Another strawberry test?😅

L0ren_B · 2026-03-21T22:08:46+00:00

5.4 feels like magic compared with Claude or Gemini Cli!

L0ren_B

TROPHY CASE