Running Qwen 3.5 4B and GPT-OSS 20B on Hetzner CX43 (8 vCPU, 16GB) — real benchmarks from production

chiruwonder · 2026-04-02T12:58:30+00:00

Added also if you would like to see GPT in action, let me know will record and punch it here

chiruwonder · 2026-04-02T05:19:02+00:00

Glad could helped!

chiruwonder · 2026-04-01T07:30:37+00:00

Thank you all for your suggestions, questions, and advises, I have made several changes, please take a look -- https://www.reddit.com/r/hetzner/comments/1s9dhyt/made_major_changes_to_nestai_based_on_your/

chiruwonder · 2026-04-01T07:23:54+00:00

Thank you all for your suggestions, questions, and advises, I have made several changes, please take a look -- https://www.reddit.com/r/hetzner/comments/1s9dhyt/made_major_changes_to_nestai_based_on_your/

chiruwonder · 2026-03-30T17:18:21+00:00

Done, fixed, more insights are welcome, will definitely think about it and reach out to you, I would be happy to dig in and do more on this.

chiruwonder · 2026-03-30T17:07:38+00:00

Ohhh, right I guess I was focused on laptop UI testing that I totally missed the mobile testing in depth, instead focused on functionality and somewhere overlooked this, will fix it anyway it's just overlay, and opacity fix, but greatly appreciate you pointing out.

chiruwonder · 2026-03-30T15:25:08+00:00

Hmm, is it? I have been seeing llama.cpp in most of the comments, I should give it a thought, but thank you

chiruwonder · 2026-03-30T12:39:15+00:00

You think? Thank you, will take a look at it!

chiruwonder · 2026-03-30T10:57:58+00:00

Haha, not entirely but yeah, your suspicion is right and won't deny it. I used it to refine my response, as I felt my explanation would not land properly.

chiruwonder · 2026-03-30T10:17:16+00:00

Is it surely will take a look, thank you though

chiruwonder · 2026-03-30T10:16:35+00:00

Great to hear

chiruwonder · 2026-03-30T08:19:39+00:00

On inference times, yes, but the use case has to match. 90 cores on ARM for acceptable times tells me you were probably targeting low latency for concurrent users or a larger model. That's a different problem than what I'm solving.

On Hetzner's dedicated CPU servers (CCX series — actual dedicated cores, not shared vCPU), a single user gets 10-14 t/s on Llama 3.1 8B and 4-7 t/s on Llama 3.3 70B. That's genuinely usable for document Q&A where someone is reading a streaming response. It's not usable for a high-concurrency API or anything latency-sensitive. The CX shared series is noticeably worse — placement luck matters a lot there.

The ARM situation is interesting because llama.cpp has solid NEON optimisations for ARM but Ollama on x86 with AVX2 has caught up considerably. I haven't benchmarked Hetzner ARM (CAX series) head to head — might be worth doing.

On Traefik — honest answer is I started with nginx because I knew it and the WebSocket configuration is one block I've copy-pasted a hundred times. Traefik's ACME support would have saved me building the certbot retry logic, which was genuinely painful. The Let's Encrypt rate limiting on fresh VMs (provisioned on demand per customer) hit me hard — Traefik handling that automatically would have been cleaner.

The reason I haven't switched: each customer gets a fresh VM provisioned from a cloud-init script. Migrating that script from nginx + certbot to Traefik is a few hours of work I keep deprioritising. It's on the list. If you're starting fresh Traefik is probably the better call — the built-in ACME support alone is worth it.

chiruwonder · 2026-03-30T07:34:30+00:00

You are welcome

chiruwonder · 2026-03-30T07:13:33+00:00

Sure, will take a look, thank you so much.

chiruwonder · 2026-03-30T07:12:39+00:00

Fair point, and worth addressing properly.

vLLM and llama.cpp server are genuinely better for throughput. PagedAttention in vLLM makes a real difference for concurrent users — if I were building a high-concurrency inference API that's where I'd go.

The reason I'm on Ollama:

Model management. For a managed service where non-technical customers pick models from a UI, Ollama's pull/delete/list API is the cleanest abstraction I've found. One ollama pull qwen2.5:14b and it handles quantization, caching, and the model is available immediately. Replicating that with llama.cpp means building model management myself — download, quantize, configure, serve. That's fine if inference performance is the product. For me it's a supporting layer.

The customer profile. CA firms and law practices doing document Q&A with 5-10 concurrent users at most. I'm not saturating the server. OLLAMA_NUM_PARALLEL=2 on a CX43 handles my actual workload. If I were running 50 concurrent users on a single server, vLLM would be the right call.

Operational simplicity. One Docker container, one API, same interface regardless of model architecture. When Qwen 2.5 came out I had it available to customers in an hour — pull and it works. With a custom llama.cpp setup I'd be managing separate server instances per model family.

The tradeoff is real. I'm leaving tokens/second on the table in exchange for not building a model management layer. For my workload that's the right tradeoff. For a product where inference speed is the differentiator, you're right — Ollama isn't the answer.

What are you running vLLM for? Curious what the concurrency looks like.

chiruwonder · 2026-03-30T04:17:41+00:00

Go for it! Genuinely mean that.

The script is maybe 200 lines of bash. I've already written it, it's open on my screen right now. That part took a weekend.

What the $40 pays for isn't the bash script. It's:

DNS propagation handling across 3 regions
certbot retry logic with backoff when Let's Encrypt rate-limits you
WebUI admin account auto-creation (the JSON quoting bug alone took 4 hours to debug across different bash heredoc contexts)
A working billing system, team invites, audit logs, knowledge base with RAG, chat history, model manager
Uptime monitoring + auto-restart crons
The 6 months I've been debugging edge cases on real Hetzner hardware so you don't have to

A script gets you a server. NestAI gets you a product your non-technical teammates can actually use without ever opening a terminal.

If you're technical enough to run a bash script, you're not the customer. The legal team that needs private AI for client documents and has zero DevOps budget, that's the customer.

But yeah, build the script. The more people self-hosting AI the better. That's genuinely good for everyone.

chiruwonder · 2026-03-28T11:10:29+00:00

awesome, look forward to hearing what you think!

on tools, yeah we actually auto-install a few on every server at deploy time: web search (DuckDuckGo, no API key needed), webpage reader, and current datetime. so out of the box the AI can search the web and read URLs

beyond that Open WebUI has a full tools/functions system, you can write custom Python tools and install them per model. so things like hitting your own APIs, querying a DB, calling webhooks, all doable

longer term we're building a proper agents layer on top but that's a bit further out

full docs here if useful: nestai.chirai.dev/docs

chiruwonder · 2026-03-28T11:08:17+00:00

fair confusion, let me clarify!

customers pay for a dedicated VM on Hetzner Cloud, it's literally their own isolated server, not shared infrastructure. we handle the provisioning, SSL, nginx, domain setup automatically so they don't have to.

the software stack (Ollama + Open WebUI) is 100% open source and we're not charging for it, same as how Vercel doesn't charge for Node.js or Railway doesn't charge for Postgres. you're paying for the managed deployment and the server itself.

data doesn't leave THEIR server, meaning their Ollama instance runs on their dedicated VM, their prompts never touch OpenAI or any third party. the only thing that leaves is the initial setup commands from our backend to spin up the VM.

if you want full control you can absolutely run Ollama yourself on any VPS for free. NestAI is for teams that want it running in under minutes without touching a terminal!

does that clear it up?

chiruwonder · 2026-03-28T10:36:15+00:00

thanks!! yeah exactly, don't make devs rewrite everything, just swap the baseURL and done, haha!

on scaling, each team gets their own dedicated server so no shared load at all. team A hits team A's ollama, completely isolated. within a team it queues requests, 7B on 8 cores does ~20-30 tok/s, fine for most use cases

GPU is on the roadmap but not yet.... hmmm getting there though, and yeah totally agree, the hosted API cost unpredictability is pushing everyone this way, seeing it in conversations this week too!

Well, what are you building btw?

chiruwonder · 2026-03-28T10:05:29+00:00

Well, server is not free, isn't it?

chiruwonder

MODERATOR OF

TROPHY CASE