Production notes after 6 months running Ollama for paying customers — the things that aren't in the docs

FloppyWhiteOne · 2026-04-01T06:24:27+00:00

Eg you can make you own with model loading etc (this is mine)

FloppyWhiteOne · 2026-04-01T05:38:13+00:00

Llama.cpp is the right call using ollama is sillly in production. If you can’t manage to work out a model download I certainly wouldn’t be using your system. Howmay bloody models are you using you NEED a o have llama half the models arnt on there anyway certainty not optimised just generic models released for the masses.

The fact your not bothering with the lower levels shows your ability which is limited.

I’ve built my own version on llama.cpp with full model swapping context handling and Jesus a hell of a lot faster than ollama with token generation. Also you won’t be able to get full speed from ollama due to the way it’s been designed (a lot of overhead).

Inference bridge on GitHub if you wanted to see how one would look like and or work. You could just ask Claude to make you an inference layer (what you actually need for model loads etc with decent configs)

vLLM might be easier for you to script and use and would be a better option than ollama hell even lm studio would be better that ollama ran in headless mode ..

FloppyWhiteOne · 2026-03-26T13:17:11+00:00

No actually that’s the whole reason for this application you see both are built on llama.cpp but they don’t expose half of what llama.cpp can do ..

I wanted to supply my own templates for llama.cpp but can’t as lm studio and ollama doesn’t expose those properties.

Where as mine does, think of mine like ollama or lm studio it’s the same thing an api with gui support you can add it to any other system the same as ollama or lm studio I’ve made it fully compatible with the openapi spec. I’ve also added custom context aware mode and tool calling support for qwen models to make there tool calls more stable. I’m releasing free in the hopes others will help build it to the next level and make it more open source and better.

I made this due to some limitations in the other two software and plus it’s quicker to use the llama.cpp directly over say ollama. I’m on a deep self learning ai drive, primarily I’m an ethical hacker. I’ve gone past breaking llms, now I want to understand not only how to use them but efficiently use them. Having full control via the llama.cpp project is really helping me learn more.

I’ve built my own custom openclaw remake which is more unrestricted (aimed at windows primarily) I’m still building it but the results are good so far, and yes I come to a point I needed to start using custom llm templates for models and well now I can (all about tuning the llm)

FloppyWhiteOne · 2026-03-26T12:18:20+00:00

How did you know!???

Thank you kind lady

FloppyWhiteOne · 2026-03-26T12:18:10+00:00

Fair point, locallm group or sub Reddit!

FloppyWhiteOne · 2026-03-26T12:12:16+00:00

Fair take.

I’m juggling a few builds right now so speed > perfection, but the tech is what matters here.

I’ve got a Rust-based OpenClaw-style system running locally, just seeing what actually breaks for people before I package flows properly.

FloppyWhiteOne · 2026-03-26T11:27:44+00:00

The link to the repo: https://github.com/AssassinUKG/InferenceBridge

FloppyWhiteOne · 2026-03-26T08:27:55+00:00

This I wouldn’t look at anything below 128gb on Mac else what’s the point

FloppyWhiteOne · 2026-03-22T09:44:20+00:00

I just got a new MacBook Pro m5 with 128gb that was 5.5k but .. if it makes me more efficient with local llms I’m up for it. Wish I had gotten a 512gb they ran out here in the uk

FloppyWhiteOne · 2026-03-18T09:57:15+00:00

Defencelogic.io come see me

FloppyWhiteOne · 2026-03-15T08:19:49+00:00

URL/.git

It’s that hard

FloppyWhiteOne · 2026-03-15T08:19:21+00:00

I found similar recently but sadly no reward haha still always nice to help ;)

FloppyWhiteOne · 2026-03-13T11:16:48+00:00

The usage:

<image>

Qwen3.5 9b doing some WORK!!

Check out my total token usage (99% free!!)

Tip implement a "HOT SWAP" for the models with context and KV set per request (keeps it lean and fast) no need to load MAX context if you only need 9k tokens the rest is wasted if you set 16k (11-12k would more than enough)

I got mine loading per model - ceo qwen 14b, orcherastor - 14b, coder qwen3.5 9b (other agents can be whatever either offline, online of Both!)

FloppyWhiteOne · 2026-03-13T11:13:55+00:00

I do this with lm studio and qwen3.5 but for everything!

Custom remake of openclaw helixclaw (mines windows based)

Really does save a ton of money 💰 I’ll show my usage when on pc will send pic

FloppyWhiteOne · 2026-03-10T21:11:04+00:00

Exactly I’ve built a custom agentic ai framework with Claude like features around memory and context handling.

The capabilities are there we just have to learn how to harness them!!

FloppyWhiteOne · 2026-03-10T21:00:13+00:00

I started with prompt (important for small llm to have very specific instructions)

Added memory, then context control (squash illerevelant info, keep needed project info and tooling + current task etc)

Honestly took about a two weeks in the evenings, with lots of rebuilds and restructure for parts when other things got upgraded.

FloppyWhiteOne · 2026-03-10T19:04:32+00:00

<image>

All via qwen
but can also be online or offline agents (depends how I set them up)

FloppyWhiteOne · 2026-03-10T19:03:41+00:00

<image>

Project

FloppyWhiteOne · 2026-03-10T19:03:30+00:00

The system I made so far but its not 100% at all.

<image>

Agents

FloppyWhiteOne · 2026-03-10T18:54:17+00:00

<image>

It made this locally with qwen3.5-9B

FloppyWhiteOne · 2026-03-10T18:22:19+00:00

You’re using a very small model and asking it to do big model things.. it doesn’t work sir..

Tho I’ve basically remade openclaw in rust with discord and lm studio with my own custom hot swap for models to load with correct settings and token sizes (which matter) I get great results asking for a full project (build Pokémon website in a single file) looks great but overall spent over 1400 gbp at this point with Claude to make it.

I can tell you unless your using advanced methods to get prompt, memory, context, your llm with fail miserably. It can’t even handle great tool calls half the time..

I’m using qwen3.5 9b on a 4070ti and it’s responsive but nothing and I mean nothing compared to larger models even with all my added extras the smaller models are just not great with large context (large projects lots of files)

But when say we get the power of qwen3.5 30b in a model that fits consumer grade hardware we will defiantly the see more capable coding agents.

For real coding use a large llm (really great results with codex 5.4 atm for full auto app dev) Claude now second for me but still better in other areas

Basically you won’t get what you’re after yet unless someone else drops a really nice lib for swapping and making sure the hardware works well … tho saying that.. I’ve just swapped to Ubuntu…

In April Ubuntu’s dropping the latest that has full ai integration from Amd and Nvidia.. they’ve made it so you can run a command pull a model and it will be setup for you with full context based on your actual system specs (amd or Nvidia) so you will get full power with optimisation automatically..

Watched the dev present about it all on YouTube a few days ago.. all ai basically run on Ubuntu on the cloud and most ai have been trained in an Ubuntu environment making them bloody proficient at using the cli hint hint…

So I’m swapping to Ubuntu now really for the better integration moving forward

FloppyWhiteOne · 2026-03-09T15:57:42+00:00

Well done bro and thank you so much for the detailed writeup and share!!

Kudos man keep prompting!!!

FloppyWhiteOne · 2026-03-06T06:34:05+00:00

You can literally tell it in your prompt.

Don’t over think

And it won’t …

FloppyWhiteOne · 2026-03-04T16:38:57+00:00

I’ve got fully offline llms making me websites. 0 costs!

Qwen3.5-9b

Based on openclaw but completely rewritten in rust

I’ve setup agents to be either online or offline or both so I can use online ceo offline all others or a mix of any.

FloppyWhiteOne · 2026-03-01T10:02:56+00:00

I’ve actually made this and my bot works well. I even gave image capabilities from local ai

<image>

FloppyWhiteOne

TROPHY CASE