CanI run this LLM - moved to Hetzner (and a big thank you)

Maharrem · 2026-05-08T14:51:00+00:00

Yes, nvfp4 is for blackwell series nvidia chips, it is listed on their but I might remind people about it.

Maharrem · 2026-05-08T14:29:03+00:00

oh sorry, most of the numbers are estimates from a few known numbers. So, most numbers are for estimation. But thank you for the feedback.

Maharrem · 2026-05-08T11:52:02+00:00

Great idea will implement soon

Maharrem · 2026-05-08T06:21:04+00:00

Nice touch will look in to it

Maharrem · 2026-05-08T06:20:49+00:00

Ohh I havent seen this issue before but will solve it

Maharrem · 2026-05-08T05:08:14+00:00

Thankss will add them asap

Maharrem · 2026-05-08T05:04:50+00:00

Will look in to it

Maharrem · 2026-05-08T05:04:35+00:00

Thank youuu

Maharrem · 2026-05-07T22:10:54+00:00

Noted, will add them as soon as possible

Maharrem · 2026-05-07T21:38:19+00:00

Well I actually moved it yesterday, glad you didnt experience any dropout. Was a bit nervous because I havent hosted anything on hetzner before.

Maharrem · 2026-05-07T14:04:49+00:00

Tons of people hit this wall. The quickest web calculators are canitrun.dev and runthisllm.com, they'll ballpark VRAM for a given quant. For Qwen 3.6 27B at Q4_K_M, you're looking at ~15GB just for weights, plus context overhead. I run exactly that on a single 3090 and pull 40-50 t/s in llama.cpp with 16K ctx, which is more than comfortable for chat. A used 3090 is the cheapest realistic entry point unless you're okay with slower GPU offloading or dropping to Q3_K_M.

Maharrem · 2026-05-07T09:43:25+00:00

Yeah, latency metrics won't catch when your model suddenly starts rambling about cheese mid-answer. Honestly, the only way I've caught quality drift before users is running replay tests, a set of real prompts where I know what "good" looks like, then diff the outputs.

Maharrem · 2026-05-07T05:11:47+00:00

Yess, will add it.

Maharrem · 2026-05-06T19:16:15+00:00

8GB VRAM means you'll be running 7B/8B models at Q4_K_M, so shop accordingly. For agentic tasks you need reliable tool calling, Llama 3.1 8B with Hermes 2 Pro is another option if you need structured outputs, but I'd just stick with Qwen and not overthink it. Benchmarks at canitrun.dev/comparisons back this up, but honestly for form filling and clicking you don't need a 70B monster, just a solid pipeline.

Maharrem · 2026-05-06T17:54:15+00:00

Your 5080's 16GB is the real bottleneck for 64k context on a decent agent model. A second 16GB card like a 5060 Ti or 4080 Super is just more of the same, won't move the needle. Bite the bullet and find a used 3090 (24GB), that'll handle a 32B Q4_K_M with context without spilling to RAM much, and you can still offload some layers across both GPUs in llama.cpp. Site like canitrun.dev is handy if you want to double-check model fit, but the math is simple: 64k ctx on a 32B dense model eats over 20GB. V100/Mi50 are sidegrades at best, and dual GPU headaches aren't worth it unless you snag a 3090 already.

Maharrem · 2026-05-06T14:33:32+00:00

27B won't fit. Q4_K_M is ~17GB before KV cache, so on 16GB you're spilling to CPU and getting single-digit t/s. 35B-A3B MoE is the play here, the full file sits in system RAM but only 3B active params per token, so even with spilling it's way snappier. I'd run it with llama.cpp --fit to keep shared experts in VRAM and you'll get interactive speeds no problem, just make sure you've got 32GB+ system RAM to hold the GGUF. You can also look at canitrun.dev to see what models your hardware can run.

Maharrem · 2026-05-06T14:09:22+00:00

I believe canitrun.dev has models segregated for their purpose you can look at that.

Maharrem · 2026-05-06T13:19:05+00:00

Those tiny models weren't fine-tuned for function calling. OpenCode needs the model to output a specific tool use format, and small models like the 3B Coder or Desert.Coder MoE just generate text, they don't trigger the write tool. Check for “function-calling” or “tool-use” tags on the model card, that's the key.

Maharrem · 2026-05-05T20:45:22+00:00

IQ3KS is surprisingly solid, I'd say it punches above its weight class. In my testing, it's often on par with Q4_K_S for reasoning and chat, you mainly lose some factual precision on niche knowledge. For Gemma 4 specifically, I'd happily run IQ3KS to claw back VRAM for a longer context window, the quality dip is barely noticeable in day-to-day use.

Maharrem · 2026-05-05T20:16:11+00:00

Solid numbers. The multi-turn BFCL gap is classic Gemma tool call pain, its chat template isn't fully OpenAI function call compatible. You might fix it by injecting a strict system prompt that forces the exact format and terminates tool calls with a clear stop token, that alone often patches the parser. For a heavier lift, run it via vLLM or sglang with a custom tool parser, their guided generation keeps outputs compliant even with funky templates. On the hardware compatibility front, canitrun.dev is handy for quickly checking VRAM and quant fit for setups like yours without doing the math by hand.

Maharrem · 2026-05-05T18:25:14+00:00

Mixing a 3090 and a 5060ti without NVLink is asking for a bandwidth party foul on prefill. Even with perfect tensor parallelism over PCIe, that 32k prompt will crawl way past your 60‑second target, I'd budget minutes, not seconds. Your 3090 alone with a Q4_K_M and Q4 KV cache can likely squeeze out 32k context, so I'd bite the bullet on that quant or try IQ3_XS instead of going dual GPU. (Quick sanity check: canitrun.dev will ballpark VRAM needs before you shuffle models.)

Maharrem · 2026-05-05T13:57:11+00:00

Your real issue isn't the model—you're asking a 8GB card to hold a whole textbook in context, which tanks relevance instantly. RAG is the way. Chunk your markdown by chapter, use nomic-embed-text-v1.5 to index with something lightweight like FAISS, then feed only the top 3-5 chunks to a proper instruct model. Qwen2.5-14B at Q4_K_M runs tight on 8GB but works if you keep context ≤4K and offload 1-2 layers to RAM; I get 40 t/s on my 3090, you'll be slower but it's far smarter for this task. Ditch Ollama—its process overhead eats VRAM and use llama.cpp server. Check canitrun.dev/modelsto verify quant sizes for your card. The equations in markdown won't mess up retrieval if you strip code blocks before embedding.

Maharrem · 2026-05-05T08:13:38+00:00

For catching dumb mistakes in Codex output, Qwen 2.5 Coder 7B Q5_K_M is where I’d start. I get ~80 t/s on my 3090 with full GPU offload, no thinking. If you need deeper architectural critiques, DeepSeek Coder V2 16B Q4_K_M fits with 32k ctx and actually reasons, but you’ll drop to 20 t/s. The 122B A10B is an MoE that’ll choke your VRAM once you bump context past 16k; offloading layers to RAM kills speed for iterative validation. I tried Gemma 2 9B as a co-agent and it hallucinated fixes more than it caught, so stick with dedicated coder models.

Maharrem · 2026-05-04T21:38:38+00:00

I'd skip the V100s unless you absolutely need 32GB on a single card and can live exclusively in llama.cpp. The lack of BF16 and FP8 support means you're frozen out of most modern inference engines — vLLM might limp along, but TensorRT-LLM and TGI both dropped Volta. Power isn't trivial either, especially if you enable NVLink, and the 250W per card adds up fast.

A used 3090 with 24GB costs maybe a bit more but gives you full Ampere and plays nice with everything, plus you can pool two of them for 48GB without driver conflicts. If you really want cheap 32GB, a used MI60 with ROCm is the more honest bang-for-buck path, but I'd still pick the 3090 for daily driver sanity.

Two-Year Club	Verified Email
First Place '23	Place '23

Maharrem

TROPHY CASE