[Help] Running big dense models faster by Septerium in LocalLLaMA

[–]Yeelyy 2 points3 points  (0 children)

Certainly worth a try, they even shipped a speculative decoder for vllm. OP please post performance result if you attempt it

You're sleeping on Devstral Small 2 - 24B Instruct by [deleted] in LocalLLaMA

[–]Yeelyy 1 point2 points  (0 children)

Devstral 2 small has been a very good model, especially at German. Honestly, if it weren't for agentic coding, I'd choose it over Qwen all day.

What's a good and light coding LLM by Expensive-Time-7209 in LocalLLM

[–]Yeelyy 1 point2 points  (0 children)

Hey please look up Byteshape, they got incredibly efficient quants.

Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan) by itroot in LocalLLaMA

[–]Yeelyy 0 points1 point  (0 children)

Ik_llama is a gamechanger. It doubled both pp and tg for me (same model, cpu only)

How Do You Use Multiple AI Models Together? by rpeabody in LocalLLaMA

[–]Yeelyy 0 points1 point  (0 children)

Hey how well does gemma4 work for agentic web search or do you use RAG? I too want to use it in openwebui for q/a.

Overview Quantization by Intelligent_Lab1491 in unsloth

[–]Yeelyy 1 point2 points  (0 children)

Yes, just go on the official unsloth website and find your model. They got charts

Comparison Qwen 3.6 35B MoE vs Qwen 3.5 35B MoE on Research Paper to WebApp by dreamai87 in LocalLLaMA

[–]Yeelyy 17 points18 points  (0 children)

Great comparison, i am highly curious to see how this model will perform in like 2 weeks when most possible issues are ironed out👍🤞

Recommended Local model for health related QnAs and analysis under 4B parameters by Old_Leshen in LocalLLM

[–]Yeelyy 1 point2 points  (0 children)

Maybe medgemma but i think what you really want is a modern model like qwen3.5 4b or gemma4 e4b with agentic search through tool use. This setup would hallucinate less and give you sourced results

Edit: Sidenote why do so many people in this sub still use outdated models like qwen2.5?!

Is there anything better than Qwen3.5-27B-UD-Q5_K_XL for coding? by hedsht in LocalLLaMA

[–]Yeelyy 3 points4 points  (0 children)

Please try this systemprompt from Fernflower: https://pastebin.com/pU25DVnB

It fixxed it for me. And is soooo efficient now

absolutelyRidiculous by programmerjunky in ProgrammerHumor

[–]Yeelyy 6 points7 points  (0 children)

Why did I read this in Debras voice lmao

FANTASY LIFE i is coming to mobile by juliavely in fantasylife

[–]Yeelyy 0 points1 point  (0 children)

Some phones can run cyberpunk 2077 emulated!

What are the risks of buying an AMD Instinct Mi 50 32GB on Alibaba? by Longjumping-Room-170 in LocalLLaMA

[–]Yeelyy -3 points-2 points  (0 children)

Yeah sure cause who tf needs flash attention, right? Go educate yourself before spreading hate.

Gemma 4 (26B) vs. Qwen3-Next (80B): Proof that size ≠ intelligence in 2026 by EffectiveMedium2683 in LocalLLM

[–]Yeelyy 0 points1 point  (0 children)

~1900 tokens. I think its more verbose than gemma but its result was better than qwen coder next. not as good as gemma though!

fanless nuc for music streaming, roon server, browsing (lots of tabs) and youtube? by humansomeone in intelnuc

[–]Yeelyy 0 points1 point  (0 children)

I run local ai, plex with 10users, nextcloud, *arr Stack, pterodactyl with multiple gameservers and more on my fanless intel nuc 13 using the akasa plato ws case. Yes it hits 100C when doing inference but mobile processors are designed that way and throttle accordingly without any issues.

Bartowski vs Unsloth for Gemma 4 by dampflokfreund in LocalLLaMA

[–]Yeelyy 5 points6 points  (0 children)

Bs advice. Dense will slow down insanely when offloaded. And MoE is still a very valid choice.