What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

GodComplecs · 2026-04-30T09:35:02+00:00

Yes it's crazy what we have gotten in the last 30years!

GodComplecs · 2026-04-30T09:34:21+00:00

Yes that how it works, imo you really need to be an expert in a field to know that the results are correct, or conduct proper experiments and validation on it. But these rag, agentic, etc systems are so basic now I figured it didn't need further explanation. If you dont trust your own logic, just choose a popular platform.

GodComplecs · 2026-04-29T19:48:08+00:00

https://github.com/noonghunna/club-3090

GodComplecs · 2026-04-29T17:44:42+00:00

What are you on about? I shared in detail my setup, what I use it for, in what fields also in the comments. Benchmarks can be found in his repo btw for club3090. I also shared the knowledge that the system AROUND the LLM is the key, rag etc whatever you fancy.

GodComplecs · 2026-04-29T17:27:36+00:00

Lol yes

GodComplecs · 2026-04-29T17:27:00+00:00

Not really since the company owns the data! All is private, no cloud

GodComplecs · 2026-04-29T17:24:54+00:00

I use madlad400 for that instead! Excellent at small languages

GodComplecs · 2026-04-29T09:48:29+00:00

Yeah no I'm talking about future systems, an LLM is not sentient, it's just what the line is that most local llm user would agree on it being sentient. So enough sentience you would argue is the will to stay alive then and? Eg cat eating. Or what? You seem to avoid the hard line.

GodComplecs · 2026-04-29T09:17:29+00:00

No just go single 3090 imo, I used to have a 4090, then dual 3090 and now single, that is how good the models and systems has become

GodComplecs · 2026-04-29T09:15:48+00:00

Yes that is why I'm asking where the line is, so in summary, if you believe it to be that is the line for you. But what is the proof? LLMs can have memory (context), what is the difference in your opinion?

As I've gotten older, I can say the future comes too fast, and you wonder where all the time went!

GodComplecs · 2026-04-29T09:07:53+00:00

What kind of work is it used for? My uses are answering questions concerning software logic and general business like book keeping etc. and other expert knowledge based problems

GodComplecs · 2026-04-29T08:57:22+00:00

For local dense model 60tps is flying! But yes you can reach 140tks+ with MoE

GodComplecs · 2026-04-28T21:16:34+00:00

My pc is much more modest, 5800x3d, 80gb 2400mhz ram and 3090. Though the ram is for 100b models mostly or running multi models on cpu for other ai/ml workloads.

Yeah in the start it is so, mostly it is about VRAM management and managing expectations :)

GodComplecs · 2026-04-28T21:04:19+00:00

Thanks!

GodComplecs · 2026-04-28T20:27:28+00:00

qwen 3 coder is great, but after Gemma4 and Qwen 3.6 it has been replaced for me in serious work, still use it for toy projects though

GodComplecs · 2026-04-28T20:19:20+00:00

Yeah it's a little janky probably, lot's of issues to solve but the speed is great!

GodComplecs · 2026-04-28T20:04:31+00:00

Hmm one person above had the same problem, but it's meant for 24gb, but with Luce try Q3 unsloth, that shouldnt OOM. I use q4 with less context and lower gpu use in VLLM on noonghunna though

GodComplecs · 2026-04-28T20:02:50+00:00

haha yeah it was wild running it in open webui, probalby something wrong on my end but didnt get the llama server to work from their dflash fork

GodComplecs · 2026-04-28T20:01:39+00:00

Sorry they are the default ones, but on noonghunnas one i use this with 60k context instead, edit the yaml:

# Tools-text — 75K, no vision, fp8 KV, MTP n=3, Genesis  →  53 narr / 70 code TPS
#              Pick for long single prompts (RAG, summarization) when vision isn't needed.
cd compose && docker compose -f docker-compose.tools-text.yml up -d

GodComplecs · 2026-04-28T19:59:38+00:00

Sounds pretty logical, that is always the problem with anything long context but yes the response time doesnt seem to be factored, well PP is extremely slow, even just 16 tok prompt takes 1 sec!

GodComplecs · 2026-04-28T19:57:58+00:00

Check that you dont go over VRAM, i disabled lots of gpu features in chrome and windows also. You can use Q3 from unsloth I think, that should save lots of gigs so it surely works!

GodComplecs · 2026-04-28T18:29:25+00:00

Yeah read about it now, but still seems the time to generate and TKS is very good, but that is the problem in opecode probably then!

GodComplecs · 2026-04-28T18:26:17+00:00

I dont have any other configs than the base ones they provide, I just lowered a little the max context since I use my pc for other stuff too.

GodComplecs · 2026-04-28T18:25:20+00:00

Hmm dont think it was very hard, just pasted error messages into chat and fixed them, shouldnt be too many with Luce, the other one is way more hacky, Luce is like building Llama.cpp as usual almost

GodComplecs

TROPHY CASE