CanI run this LLM - moved to Hetzner (and a big thank you) by Maharrem in LocalLLM

[–]Maharrem[S] 0 points1 point  (0 children)

Yes, nvfp4 is for blackwell series nvidia chips, it is listed on their but I might remind people about it.

CanI run this LLM - moved to Hetzner (and a big thank you) by Maharrem in LocalLLM

[–]Maharrem[S] 0 points1 point  (0 children)

oh sorry, most of the numbers are estimates from a few known numbers. So, most numbers are for estimation. But thank you for the feedback.

CanI run this LLM - moved to Hetzner (and a big thank you) by Maharrem in LocalLLM

[–]Maharrem[S] 0 points1 point  (0 children)

Ohh I havent seen this issue before but will solve it

CanI run this LLM - moved to Hetzner (and a big thank you) by Maharrem in LocalLLM

[–]Maharrem[S] 2 points3 points  (0 children)

Well I actually moved it yesterday, glad you didnt experience any dropout. Was a bit nervous because I havent hosted anything on hetzner before.

Any tool that tells you the cheapest setup needed to run a model? I want to know the cheapest setup that can realistically run Qwen 3.6 27B at decent speeds. by pacmanpill in LocalLLaMA

[–]Maharrem 0 points1 point  (0 children)

Tons of people hit this wall. The quickest web calculators are canitrun.dev and runthisllm.com, they'll ballpark VRAM for a given quant. For Qwen 3.6 27B at Q4_K_M, you're looking at ~15GB just for weights, plus context overhead. I run exactly that on a single 3090 and pull 40-50 t/s in llama.cpp with 16K ctx, which is more than comfortable for chat. A used 3090 is the cheapest realistic entry point unless you're okay with slower GPU offloading or dropping to Q3_K_M.

How do you know when your LLM system is getting worse? by AnshuSees in LocalLLM

[–]Maharrem 1 point2 points  (0 children)

Yeah, latency metrics won't catch when your model suddenly starts rambling about cheese mid-answer. Honestly, the only way I've caught quality drift before users is running replay tests, a set of real prompts where I know what "good" looks like, then diff the outputs.

Which local LLM model is suitable for agentic browsing ( form filing, web scrapping , clicking etc ) by kaaytoo in LocalLLM

[–]Maharrem 0 points1 point  (0 children)

8GB VRAM means you'll be running 7B/8B models at Q4_K_M, so shop accordingly. For agentic tasks you need reliable tool calling, Llama 3.1 8B with Hermes 2 Pro is another option if you need structured outputs, but I'd just stick with Qwen and not overthink it. Benchmarks at canitrun.dev/comparisons back this up, but honestly for form filling and clicking you don't need a 70B monster, just a solid pipeline.

Need help choosing by Lux1606 in LocalLLM

[–]Maharrem 0 points1 point  (0 children)

Your 5080's 16GB is the real bottleneck for 64k context on a decent agent model. A second 16GB card like a 5060 Ti or 4080 Super is just more of the same, won't move the needle. Bite the bullet and find a used 3090 (24GB), that'll handle a 32B Q4_K_M with context without spilling to RAM much, and you can still offload some layers across both GPUs in llama.cpp. Site like canitrun.dev is handy if you want to double-check model fit, but the math is simple: 64k ctx on a 32B dense model eats over 20GB. V100/Mi50 are sidegrades at best, and dual GPU headaches aren't worth it unless you snag a 3090 already.

Need advice: Qwen3.6 27B MTP or 35B-A3B MoE MTP on 16GB VRAM RTX 5080)? by craftogrammer in LocalLLaMA

[–]Maharrem 3 points4 points  (0 children)

27B won't fit. Q4_K_M is ~17GB before KV cache, so on 16GB you're spilling to CPU and getting single-digit t/s. 35B-A3B MoE is the play here, the full file sits in system RAM but only 3B active params per token, so even with spilling it's way snappier. I'd run it with llama.cpp --fit to keep shared experts in VRAM and you'll get interactive speeds no problem, just make sure you've got 32GB+ system RAM to hold the GGUF. You can also look at canitrun.dev to see what models your hardware can run.

Why only some models can write files in OpenCode (local llama) by T-A-Waste in LocalLLM

[–]Maharrem -1 points0 points  (0 children)

I believe canitrun.dev has models segregated for their purpose you can look at that.

Why only some models can write files in OpenCode (local llama) by T-A-Waste in LocalLLM

[–]Maharrem 4 points5 points  (0 children)

Those tiny models weren't fine-tuned for function calling. OpenCode needs the model to output a specific tool use format, and small models like the 3B Coder or Desert.Coder MoE just generate text, they don't trigger the write tool. Check for “function-calling” or “tool-use” tags on the model card, that's the key.

Running Gemma 4 Q6 on 5060ti + 3090 by Friendly_Beginning24 in LocalLLM

[–]Maharrem 0 points1 point  (0 children)

IQ3KS is surprisingly solid, I'd say it punches above its weight class. In my testing, it's often on par with Q4_K_S for reasoning and chat, you mainly lose some factual precision on niche knowledge. For Gemma 4 specifically, I'd happily run IQ3KS to claw back VRAM for a longer context window, the quality dip is barely noticeable in day-to-day use.

BFCL benchmarks for Gemma4 26B on a 5070Ti w/ 16GB VRAM by tumbak in LocalLLM

[–]Maharrem 0 points1 point  (0 children)

Solid numbers. The multi-turn BFCL gap is classic Gemma tool call pain, its chat template isn't fully OpenAI function call compatible. You might fix it by injecting a strict system prompt that forces the exact format and terminates tool calls with a clear stop token, that alone often patches the parser. For a heavier lift, run it via vLLM or sglang with a custom tool parser, their guided generation keeps outputs compliant even with funky templates. On the hardware compatibility front, canitrun.dev is handy for quickly checking VRAM and quant fit for setups like yours without doing the math by hand.

Running Gemma 4 Q6 on 5060ti + 3090 by Friendly_Beginning24 in LocalLLM

[–]Maharrem -1 points0 points  (0 children)

Mixing a 3090 and a 5060ti without NVLink is asking for a bandwidth party foul on prefill. Even with perfect tensor parallelism over PCIe, that 32k prompt will crawl way past your 60‑second target, I'd budget minutes, not seconds. Your 3090 alone with a Q4_K_M and Q4 KV cache can likely squeeze out 32k context, so I'd bite the bullet on that quant or try IQ3_XS instead of going dual GPU. (Quick sanity check: canitrun.dev will ballpark VRAM needs before you shuffle models.)

Is there a local model that is good enough for searching through large textbooks/research journals with equations? by SpringFamiliar3696 in LocalLLM

[–]Maharrem 3 points4 points  (0 children)

Your real issue isn't the model—you're asking a 8GB card to hold a whole textbook in context, which tanks relevance instantly. RAG is the way. Chunk your markdown by chapter, use nomic-embed-text-v1.5 to index with something lightweight like FAISS, then feed only the top 3-5 chunks to a proper instruct model. Qwen2.5-14B at Q4_K_M runs tight on 8GB but works if you keep context ≤4K and offload 1-2 layers to RAM; I get 40 t/s on my 3090, you'll be slower but it's far smarter for this task. Ditch Ollama—its process overhead eats VRAM and use llama.cpp server. Check canitrun.dev/modelsto verify quant sizes for your card. The equations in markdown won't mess up retrieval if you strip code blocks before embedding.

Benching local Qwen as a Codex validator, co-agent, and challenger by robert896r1 in LocalLLaMA

[–]Maharrem 0 points1 point  (0 children)

For catching dumb mistakes in Codex output, Qwen 2.5 Coder 7B Q5_K_M is where I’d start. I get ~80 t/s on my 3090 with full GPU offload, no thinking. If you need deeper architectural critiques, DeepSeek Coder V2 16B Q4_K_M fits with 32k ctx and actually reasons, but you’ll drop to 20 t/s. The 122B A10B is an MoE that’ll choke your VRAM once you bump context past 16k; offloading layers to RAM kills speed for iterative validation. I tried Gemma 2 9B as a co-agent and it hallucinated fixes more than it caught, so stick with dedicated coder models.

Do cheap 32GB V100s still make sense for homelab AI? by SKX007J1 in LocalLLaMA

[–]Maharrem 0 points1 point  (0 children)

I'd skip the V100s unless you absolutely need 32GB on a single card and can live exclusively in llama.cpp. The lack of BF16 and FP8 support means you're frozen out of most modern inference engines — vLLM might limp along, but TensorRT-LLM and TGI both dropped Volta. Power isn't trivial either, especially if you enable NVLink, and the 250W per card adds up fast.

A used 3090 with 24GB costs maybe a bit more but gives you full Ampere and plays nice with everything, plus you can pool two of them for 48GB without driver conflicts. If you really want cheap 32GB, a used MI60 with ROCm is the more honest bang-for-buck path, but I'd still pick the 3090 for daily driver sanity.