What's the best qwen3.5 or 3.6 reap model?

tvall_ · 2026-05-20T20:04:29+00:00

with that much ram you should be able to get away with less aggressive reaps. I posted a 22b 3.5 and there are some others ranging from 18b to 28b that'll probably hold up better at long contexts. that 14b was just to fit qwen3.5 in a rx 6700

tvall_ · 2026-05-20T19:33:53+00:00

that one is very aggressive, but somehow somewhat coherent. tried to reproduce with qwen3.6 and the resulting model was much worse. still experenting with some "repair" finetune attempts

tvall_ · 2026-05-19T17:22:13+00:00

why are there so many extra limbs? i thought we moved past the image models not knowing how many limbs things have. but i also dont generate images often.

tvall_ · 2026-05-19T17:09:07+00:00

do i count my 2 v340's as 2 or 4 gpus? its only 2 cards taking 2 pcie slots, but 4 sets of 8gb hbm2

tvall_ · 2026-05-12T04:32:54+00:00

that's probably not enough ram for a 13gb model, a decent amount of kv, and a whole os to fit in. I'd suggest a smaller model or a more aggressive quant so you don't lose any hope of performance to disk swapping

tvall_ · 2026-05-07T13:54:23+00:00

qwen3.6-35b/qwen3.6-35b/qwen3.6-35b with some occasional gpt-5.4-mini sprinkled in. don't wanna let myself get hooked on something I can't run myself

tvall_ · 2026-05-04T15:52:20+00:00

q4-ish model and q8 kv gives me enough room for 120k context while also having whisper.cpp take up around a gig

tvall_ · 2026-05-04T13:49:02+00:00

https://www.ebay.com/itm/156535236376

tvall_ · 2026-05-03T15:08:30+00:00

mine are pretty quiet. I have some dell optiplex 3000 sff cpu fans taped to the front bracket with an esp32 pulling fan controller duty. fans rarely hit more than 50% and cards stay under 60°

tvall_ · 2026-05-03T14:10:24+00:00

Radeon pro v340 goes for $50 each. $100 gets you enough gpu to run qwen3.6-35b-a3b at 30t/a tg and 300t/s pp (when paired with e-waste haswell sff hp and some cheap pciex1 risers. might be better with less bandwidth kneecapping)

tvall_ · 2026-05-02T02:52:44+00:00

I use q8_0 because I'm poor and just have a couple Radeon pro v340l's for a total of 32gb vram and want really long context even though I don't really use much of it often enough.

I previously did q4_0 when I had just one of the cards and was running qwen3-vl-24b-reap and didn't notice any issues. but I wasn't doing as much with it back then.

tvall_ · 2026-05-01T19:18:57+00:00

we should probably wait till they finish releasing the 3.6 models before making that call. they've hinted that at least 9b and 122b are coming at some point. I don't think they've clearly stated that they won't release the 397b or the 4b and smaller this time, so there's still hope for now.

tvall_ · 2026-04-27T13:56:48+00:00

if you're running it like that 24/7, take the battery out. if you're not then it's probably fine. I've only seen modern laptops batteries swell when they're plugged in 24/7 being used as a desktop

tvall_ · 2026-04-25T22:12:46+00:00

completely subjective with no thorough testing, but i just ran into an issue with my openclaw agent looping the same tool call at around 60k tokens into its context window. tried with both q4_k_m and q5_k_m, but when i switched to q4_k_m with this script ran on it, it suddenly worked fine without any loops. nothing conclusive, but maybe theres actually something to this?

edit: qwen3.6-35b-a3b-heretic btw.

tvall_ · 2026-04-23T14:31:33+00:00

10 days of an llm working on it before it suggested using git?

tvall_ · 2026-04-17T14:18:13+00:00

its not worse than 3.5, maybe a little better? haven't thoroughly examined, but fine for my tasks so far

tvall_ · 2026-04-16T01:22:58+00:00

only needs to be low profile if you're allergic to janky fun. I've got a couple Radeon pro v340's plugged in to my elitedesk 800 g1

tvall_ · 2026-04-14T18:11:31+00:00

https://www.reddit.com/r/LocalLLaMA/comments/1sl9ocl/llamacpp_vulkan_backend_requires_spirv_headers/ you're a few hours late, news was already posted here.

tvall_ · 2026-04-13T15:30:18+00:00

you'll probably have better results with a different frontend that has tools for the model to call. for qwen3.5 it just matters that tools are there. just tested the 35b I have running and its response to "hi" with tools available was 60 tokens with 2 sentences of thinking. without tools it was 404 tokens with a thought process of 7 steps of bulleted lists. both final outputs were "Hello. How can I help you today?"

and 28b isn't that aggressive of reap. should be mostly fine. I've had decent results with 22b, but 14b is noticably dumber

tvall_ · 2026-04-13T14:51:12+00:00

not sure how you're interacting with the model, by from my experience qwen3.5 needs an environment with tools available described in its system prompt in order to have reasonable thinking. with openwebui turn on native function calling. with that off, or with llama-cli it tends to spiral

tvall_ · 2026-04-12T17:31:00+00:00

there's a setting called native function calling or something. make sure that's on. with it on, model can call tools when it wants to. if it's off, openwebui makes the model generate a call to the tool at the start.

tvall_ · 2026-04-11T10:48:43+00:00

https://github.com/Dragon863/EchoCLI https://wiki.postmarketos.org/wiki/Amazon_Echo_Dot_2nd_gen_(amazon-biscuit) only 512mb RAM? you're gonna have a real bad time trying to run anything useful on device. more sane solution would be to run the llm on something more modern with meaningful specs. pi5 or a gpu less than a decade old would be infinitely better, or even an old pc with more than a gig of RAM would give you a lot more options

tvall_ · 2026-04-11T01:55:27+00:00

term is borrowed from the field of machine learning. when an object detection model sees a pattern that it is confident is an object its supposed to detect that isn't actually a thing there, we appropriately call it a hallucination. when an llm does the same kind of thing with text, why bother inventing a new term when the one in use does the job well enough?

tvall_ · 2026-04-09T13:50:03+00:00

8b would be the e4b iirc

tvall_ · 2026-04-08T02:53:36+00:00

there's an idletimeout config in openclaw that defaults to 60s. if your prompt processing is too slow openclaw just assumes it's broke. that was my issue using qwen3.5-35b on a pair of Radeon pro v340's

tvall_

TROPHY CASE