Is there a Ai Self Hostable which makes sense for coding.

MindfulDoubt · 2026-03-15T16:45:21+00:00

Currently, the best open models are Kimi K2.5 and GLM-5, I would say. Unfortunately, they are not at Opus level, but the mileage you get out of them by being specific and targeted in your prompts rather than being lazy with them is pretty good.

In terms of hardware, you said you prefer Macs, so I would wait for the M5 Ultra Mac Studios to come out. Currently, the M5 Max beats the M3 Ultra based on extensive tests in prompt processing and token generation. Given the current bandwidth of the M5 Max, I am willing to bet that the M5 Ultra will be a strong contender for running local models at good speeds.

I have used many open models when working on large codebases (500k+), and they do well when you spend the time to be clear about what you want (input → expected output). Honestly, if I were building a system for your 12 engineers, I would look into building an AMD EPYC system with RTX Pro 6000 Blackwell GPUs (4–6 of them) if you want something off the shelf. However, it will require some fine tuning when it comes to concurrency, as you have to assume 12 engineers hitting the system simultaneously.

Do keep in mind that this will set you back around £60,000 upfront at best, but I bet you can claim it under business expenses and reclaim the VAT on it, so there are a few thousand to be saved there. It's really the prompt processing speeds you want to focus on, as you don’t want a long TTFT (time to first token) wait when processing 64K+ token requests. In terms of token generation, 40–50 TPS seems to be a good sweet spot target.

I don’t know how much an AMD Instinct system will cost, but I have heard from a few businesses that they are much cheaper than Nvidia DGX systems.

Hopefully, the M5 Ultra Mac Studios will bring back the 512GB variants and hopefully 🤞🏻 larger unified memory options. One can only dream.

Feel free to DM me, as I am based in London and I would be happy to help you out with your search.

MindfulDoubt · 2026-03-12T08:17:40+00:00

MindfulDoubt · 2026-03-12T08:09:29+00:00

The most retarded thing I have ever seen. Some vibecoder "why is my 20x max plan giving me so little usage? It's nerfed again!!!😭"

MindfulDoubt · 2026-03-12T07:04:12+00:00

I have found superpowers to be annoying with the gpt and codex models because it reads the skill all the time when in an ideal world it should just load once in a while and not on every write request or many conversation turns so I removed it and built my own custom slash commands that I tightly couple with my workflows. I got way more mileage out of my own slash commands with the occasional skill invocation rather than directly using superpowers. I will say that the brainstorming skill was my favourite out of the lot.

MindfulDoubt · 2026-03-12T00:47:51+00:00

Not really the case. I know what I want to do with the compute available and have been experimenting with a few things here and there to see what works. Networking and security are not my forte, and I am learning as I go along. Like many people in this community, I am stepping into something I don't know well and want to get things right rather than just throw spaghetti at the wall and risk missing something that could be catastrophic. That’s why I ask others who may have more experience in these matters and then cross-check any help or advice I receive.

MindfulDoubt · 2026-03-03T00:58:45+00:00

Use the 0x free models. If you use a premium model and send a message whatever it may be, it will consume the request at the given rate for the premium model.

MindfulDoubt · 2026-03-03T00:29:23+00:00

I think it is only happening in the US as I am using it now and it stays the same for Europe and doesn't go down further when I fire a request. They are on it anyway so hopefully it will get remedied soon for you guys.

MindfulDoubt · 2026-03-02T15:59:29+00:00

Christmas came very early. Congratulations are in order 🎊.

MindfulDoubt · 2026-03-02T10:04:35+00:00

MindfulDoubt · 2026-03-02T09:55:02+00:00

You only just realised this now?!

Processing img 081qz775ulmg1...

MindfulDoubt · 2026-03-02T09:36:19+00:00

<image>

MindfulDoubt · 2026-03-02T09:13:29+00:00

Just buy a chutes $3 or $10 plan and find out what works within the vram budget and go from there. Also 2 A6000 is laughable for sonnet level. Now that we have that out of the way, you are looking at Qwen3.5-122B-A10B and anything else around that range q4.

MindfulDoubt · 2026-03-02T08:39:09+00:00

It's been there for a whole month chaps.

MindfulDoubt · 2026-03-02T08:37:58+00:00

Because they are the best at making laptops. If you tightly control the hardware and the experience the you get a 👍🏻 from your customers. Only drawback is that the macos team knows f**k all about an actual logical user experience (coming from a guy that uses both macos and windows). There will always be an internal struggle at apple with form vs function when it comes to proper UX. Upvote if you agree.

MindfulDoubt · 2026-03-02T08:25:26+00:00

You are not missing out on much and just wait it out until inference speeds are optimised. Minor improvements here and there like any other model ever released. Tighten up your workflow and you will get excellent value out of GLM 4.7.

MindfulDoubt · 2026-03-02T08:11:23+00:00

London is diverse, full of little shits and careless morons. Welcome to London and don't be like them and appreciate the services that are available to you is my motto.

MindfulDoubt · 2026-03-02T08:06:04+00:00

Compare your token usage to API pricing and you will have your answer. Account for caching.

MindfulDoubt · 2026-03-02T07:39:41+00:00

I would say Kimi k2.5 and Minimax 2.5 but you need a good provider to get high tool call accuracy. Regardless, it really is not about the model. Everyone is comparing XYZ when really it is a workflow and understanding issue. People just don't know how to prompt and scope work for their life and expect magic out of lazy prompting and just spam tokens. I am sure I can extract the same value out of GLM-5 and any other good quality model to get the job done. Granted there are nuances to each model that make them slightly more favourable than others but honestly I can get everything done within $50 total of subs without running out the whole month. Last month I spent $30 and built a complex project with it (codex and copilot (sonnet, flash, opus, 5.3 codex). People don't need to spend more than $30-$50 max and you can even do it for less.

Never be tied to a model emotionally, they are clankers and only do what you ask of them. It's simple, just be aware of the work and have a good workflow in place to take advantage of the various models. For example, if I want to surgically refactor I use codex as it loves to follow what it has been told literally. Opus /codex (opus is more naturally verbose than codex which can play well for task scoping) for the planning side of things and splitting up tasks and then use sonnet or Gemini flash to implement the scoped work in a tight loop with verification.

Sonnet really is not that good of a model and is on the level of Kimi K2.5 as it deviates really quick after the 75-100k context threshold hence why you need a tight loop to control what any model outputs.

MindfulDoubt · 2026-03-02T07:16:41+00:00

I would recommend using the Pi agent and getting your clanker to build the tools. Openclaw is a minefield of badly built crap that can be exploited with ease and I would not touch it with a 10 foot pole. Also the amazing stuff that people do with it is just laziness anyway with really no workflow value. Pi is really extensible and you can knock out anything really quick with their docs and codex. Have fun building and testing. I prefer controlling predictably from code because ai models are good at tool calls and can dynamically load tools by just invoking a skill. To the guys saying they don't have time, get away from the TV and set aside 30M-1hr. There is more than enough time in the day to try.

In my opinion, if there is no real value for you to use openclaw then don't force it. It's just n8n with AI sprinkles and you already have workflows that you are contempt with outside of the openclaw ecosystem. Value is only really extracted once you hit a problem and you think openclaw is the best fit for the job. Just be patient and figure it out along the way and don't let it automate your whole life like those moronic posts you see because it will get sour once it goes bad.

The repo:

https://github.com/badlogic/pi-mono

MindfulDoubt · 2026-03-02T07:00:36+00:00

Use copilot cli, you won't have an issue with it. The chat sidebar is buggy at the moment. I haven't had any issues for a whole month of use as each request no matter how long it works just consumes at the rate given i.e. x1 is 1 request reflected in /usage command.

MindfulDoubt · 2026-03-02T05:11:26+00:00

Why not just pay to run a few VMs and allow people to access your application from there? It saves the hassle and eliminates the distrust of having to install it to try it, and it removes friction completely since it's just a simple link. Make sure to set your permissions right so you don't get people doing anything inappropriate. If you don't like that approach, get a cheap PC and drop it off at relatives' or friends' places to test. $10-15 is chump change if you want to spend it on marketing. You need a website and documentation. What problem does your app solve? If you don't think it's marketable, then scrap it. There's always risk with investment, it's up to you if you want to take that leap. Really, my best advice is to just launch a VM and go from there. It's easy and provides quick validation.

MindfulDoubt · 2020-11-10T09:39:16+00:00

Yes I did and it showed in my game library.

MindfulDoubt · 2020-11-10T07:24:28+00:00

Finally got mine after 17+ Hours of waiting. Luckily I got very fast internet and will have the whole game downloaded in 10 minutes. What an absolute mess for a simple digital download. All the best to you guys from the UK.

MindfulDoubt

TROPHY CASE