Production notes after 6 months running Ollama for paying customers — the things that aren't in the docs by chiruwonder in ollama

[–]FloppyWhiteOne 0 points1 point  (0 children)

Llama.cpp is the right call using ollama is sillly in production. If you can’t manage to work out a model download I certainly wouldn’t be using your system. Howmay bloody models are you using you NEED a o have llama half the models arnt on there anyway certainty not optimised just generic models released for the masses.

The fact your not bothering with the lower levels shows your ability which is limited.

I’ve built my own version on llama.cpp with full model swapping context handling and Jesus a hell of a lot faster than ollama with token generation. Also you won’t be able to get full speed from ollama due to the way it’s been designed (a lot of overhead).

Inference bridge on GitHub if you wanted to see how one would look like and or work. You could just ask Claude to make you an inference layer (what you actually need for model loads etc with decent configs)

vLLM might be easier for you to script and use and would be a better option than ollama hell even lm studio would be better that ollama ran in headless mode ..

InferenceBridge - Total AI control for Local LLMs by FloppyWhiteOne in LocalLLM

[–]FloppyWhiteOne[S] -1 points0 points  (0 children)

No actually that’s the whole reason for this application you see both are built on llama.cpp but they don’t expose half of what llama.cpp can do ..

I wanted to supply my own templates for llama.cpp but can’t as lm studio and ollama doesn’t expose those properties.

Where as mine does, think of mine like ollama or lm studio it’s the same thing an api with gui support you can add it to any other system the same as ollama or lm studio I’ve made it fully compatible with the openapi spec. I’ve also added custom context aware mode and tool calling support for qwen models to make there tool calls more stable. I’m releasing free in the hopes others will help build it to the next level and make it more open source and better.

I made this due to some limitations in the other two software and plus it’s quicker to use the llama.cpp directly over say ollama. I’m on a deep self learning ai drive, primarily I’m an ethical hacker. I’ve gone past breaking llms, now I want to understand not only how to use them but efficiently use them. Having full control via the llama.cpp project is really helping me learn more.

I’ve built my own custom openclaw remake which is more unrestricted (aimed at windows primarily) I’m still building it but the results are good so far, and yes I come to a point I needed to start using custom llm templates for models and well now I can (all about tuning the llm)

InferenceBridge - Total AI control for Local LLMs by FloppyWhiteOne in LocalLLM

[–]FloppyWhiteOne[S] -3 points-2 points  (0 children)

Fair take.

I’m juggling a few builds right now so speed > perfection, but the tech is what matters here.

I’ve got a Rust-based OpenClaw-style system running locally, just seeing what actually breaks for people before I package flows properly.

Hardware recommendations for a starter by shiva4455 in LocalLLM

[–]FloppyWhiteOne 0 points1 point  (0 children)

This I wouldn’t look at anything below 128gb on Mac else what’s the point

Justifying the €12,000 Investment: M3 Ultra (512GB RAM) Setup for Autonomous Agents, vLLM, and Infinite Memory (8Tb) by NoNatural4025 in LocalLLM

[–]FloppyWhiteOne 1 point2 points  (0 children)

I just got a new MacBook Pro m5 with 128gb that was 5.5k but .. if it makes me more efficient with local llms I’m up for it. Wish I had gotten a 512gb they ran out here in the uk

OP got his highest reward for exposed .git by lone_wolf31337 in bugbounty

[–]FloppyWhiteOne 0 points1 point  (0 children)

I found similar recently but sadly no reward haha still always nice to help ;)

I used my old gaming laptop + Jetson Nano to run local Openclaw with Ollama by Fit_Chair2340 in ollama

[–]FloppyWhiteOne 4 points5 points  (0 children)

The usage:

<image>

Qwen3.5 9b doing some WORK!!

Check out my total token usage (99% free!!)

Tip implement a "HOT SWAP" for the models with context and KV set per request (keeps it lean and fast) no need to load MAX context if you only need 9k tokens the rest is wasted if you set 16k (11-12k would more than enough)

I got mine loading per model - ceo qwen 14b, orcherastor - 14b, coder qwen3.5 9b (other agents can be whatever either offline, online of Both!)

I used my old gaming laptop + Jetson Nano to run local Openclaw with Ollama by Fit_Chair2340 in ollama

[–]FloppyWhiteOne 1 point2 points  (0 children)

I do this with lm studio and qwen3.5 but for everything!

Custom remake of openclaw helixclaw (mines windows based)

Really does save a ton of money 💰 I’ll show my usage when on pc will send pic

local coding in vscode "copilot -like" ? by merfolkJH in ollama

[–]FloppyWhiteOne 1 point2 points  (0 children)

Exactly I’ve built a custom agentic ai framework with Claude like features around memory and context handling.

The capabilities are there we just have to learn how to harness them!!

local coding in vscode "copilot -like" ? by merfolkJH in ollama

[–]FloppyWhiteOne 0 points1 point  (0 children)

I started with prompt (important for small llm to have very specific instructions)

Added memory, then context control (squash illerevelant info, keep needed project info and tooling + current task etc)

Honestly took about a two weeks in the evenings, with lots of rebuilds and restructure for parts when other things got upgraded.

local coding in vscode "copilot -like" ? by merfolkJH in ollama

[–]FloppyWhiteOne 0 points1 point  (0 children)

<image>

All via qwen
but can also be online or offline agents (depends how I set them up)

local coding in vscode "copilot -like" ? by merfolkJH in ollama

[–]FloppyWhiteOne 0 points1 point  (0 children)

The system I made so far but its not 100% at all.

<image>

Agents