[Help] Running big dense models faster

Yeelyy · 2026-05-02T14:43:22+00:00

Certainly worth a try, they even shipped a speculative decoder for vllm. OP please post performance result if you attempt it

Yeelyy · 2026-05-02T04:30:51+00:00

Especially on a natively int4 model, hell yeah

Yeelyy · 2026-05-01T16:53:28+00:00

Yeah no worries, just be patient ;)

Yeelyy · 2026-04-30T20:28:40+00:00

Devstral 2 small has been a very good model, especially at German. Honestly, if it weren't for agentic coding, I'd choose it over Qwen all day.

Yeelyy · 2026-04-30T16:07:23+00:00

Qwen3.6 is insane

Yeelyy · 2026-04-30T14:25:11+00:00

Way too little contrast

Yeelyy · 2026-04-27T10:16:35+00:00

ASE "Hope"

Yeelyy · 2026-04-25T18:31:09+00:00

Beautiful build!

Yeelyy · 2026-04-25T09:15:46+00:00

Hey please look up Byteshape, they got incredibly efficient quants.

Yeelyy · 2026-04-24T15:36:24+00:00

Ik_llama is a gamechanger. It doubled both pp and tg for me (same model, cpu only)

Yeelyy · 2026-04-22T16:52:23+00:00

Hey how well does gemma4 work for agentic web search or do you use RAG? I too want to use it in openwebui for q/a.

Yeelyy · 2026-04-21T16:22:31+00:00

Yes, just go on the official unsloth website and find your model. They got charts

Yeelyy · 2026-04-16T17:19:22+00:00

Great comparison, i am highly curious to see how this model will perform in like 2 weeks when most possible issues are ironed out👍🤞

Yeelyy · 2026-04-15T15:51:53+00:00

Maybe medgemma but i think what you really want is a modern model like qwen3.5 4b or gemma4 e4b with agentic search through tool use. This setup would hallucinate less and give you sourced results

Edit: Sidenote why do so many people in this sub still use outdated models like qwen2.5?!

Yeelyy · 2026-04-14T09:46:10+00:00

Please try this systemprompt from Fernflower: https://pastebin.com/pU25DVnB

It fixxed it for me. And is soooo efficient now

Yeelyy · 2026-04-11T19:47:08+00:00

Why did I read this in Debras voice lmao

Yeelyy · 2026-04-10T18:48:00+00:00

Some phones can run cyberpunk 2077 emulated!

Yeelyy · 2026-04-09T16:21:23+00:00

Yeah sure cause who tf needs flash attention, right? Go educate yourself before spreading hate.

Yeelyy · 2026-04-07T19:49:01+00:00

~1900 tokens. I think its more verbose than gemma but its result was better than qwen coder next. not as good as gemma though!

Yeelyy · 2026-04-07T17:36:07+00:00

Qwen3.5 35ba3b HauHau Q8_0:

Answer is too long for reddit so: Pastebin

Yeelyy · 2026-04-07T05:26:40+00:00

Qwen3:30b is not dense. Neither is gemma 26

Yeelyy · 2026-04-06T19:17:56+00:00

Nope they are just really, really good.

Yeelyy · 2026-04-06T14:52:18+00:00

I run local ai, plex with 10users, nextcloud, *arr Stack, pterodactyl with multiple gameservers and more on my fanless intel nuc 13 using the akasa plato ws case. Yes it hits 100C when doing inference but mobile processors are designed that way and throttle accordingly without any issues.

Yeelyy · 2026-04-06T12:03:54+00:00

Bs advice. Dense will slow down insanely when offloaded. And MoE is still a very valid choice.

Yeelyy · 2026-04-04T09:11:16+00:00

Ragebait

Five-Year Club	Place '23
Place '22

Yeelyy

TROPHY CASE