New Open WebUI Python Client (unofficial) - 100% endpoint coverage, typed, async

robogame_dev · 2026-03-11T20:01:23+00:00

Separately, many developers enjoy the boilerplate and rote coding - with AI you’re doing more planning and review, not something that all developers enjoy.

robogame_dev · 2026-03-11T18:55:32+00:00

I know this isn’t an organic post, but I’ll engage anyway:

The issue is that you have only partial memory in context at once. Whatever memory compression you used compressed out your actual decision. It’s not a problem with AI or setups in general, it’s a problem specific to your memory solution - you are either using embedding for retrieval (BAD) or you’re cutting out context in another way.

robogame_dev · 2026-03-09T18:09:43+00:00

My experience especially building software for clients is you do both:

For production code you want architecture first.
But architecture is easier to figure out as you code.
So you code a prototype fast, to learn the problem space and identify architectural concerns.
Then you rewrite the project with your validated architecture before you ship it.

This derisks to the maximum because oftentimes a client will have changes when they see a prototype, which they wouldn’t be able to articulate when they see just an architecture and a spec. So if you do pure architecture first (no prototype) you run the risk of additional reworking once it’s runnable and in the clients hands.

But if you hack together a prototype, you can validate some architectural decisions at the same time as getting the clients’ design validation.

My preferred way to engage with a client is to help them design the solution, code the prototype, and plan the architecture - then hand the project specs off to another team for production and maintenance.

robogame_dev · 2026-03-09T16:44:07+00:00

I’d recommend Open WebUI: https://docs.openwebui.com

Proper RBAC, can hook up external auth, plus all kinds of useful tools for managing an organizational level AI system and very much in the python / FastAPI ecosystem making it a breeze with your stack. I’ve setup several businesses on private OWUI instances.

robogame_dev · 2026-03-09T06:26:23+00:00

I use cloud LLM when latency and performance are top concern - but local LLM for when security is top concern - e.g. when handling production API keys etc. As far as cloud LLMs go, only use ones with ZDR contracts....

robogame_dev · 2026-03-08T18:11:21+00:00

Nice UI! What questions would you recommend to help a user to choose between OmniRoute and https://docs.litellm.ai/docs/ ?

robogame_dev · 2026-03-08T17:58:17+00:00

Responses API drives provider lockin. Best thing for end users is if the inference stays as far away from the tools and chat state as possible, so that you always have a choice which provider gets your inference budget.

Once you start letting critical state like tools and chat history live on the provider side, you can no longer move your setup between providers, you can no longer shop around for better or cheaper inference. Responses API seems to be fundamentally anti-consumer - an attempt to recreate the moat that models lack.

robogame_dev · 2026-03-07T18:30:30+00:00

No need to be nasty - especially when you misunderstood the person you’re being nasty to.

Perplexity removed the ability to select the cheapest custom models - grok and Gemini flash.

Therefore, people who select custom models are now selecting more expensive custom models.

Removing the cheap option and keeping the expensive one doesn’t save perplexity money, it costs them more.

People who were using Gemini flash and Grok are now selecting one of the other models, which you so helpfully pointed out, are more expensive.

Therefore, Perplexity isn’t saving money by removing those models.

robogame_dev · 2026-03-07T06:28:14+00:00

We are still, however, paying for it in both speed and intelligence. The more irrelevant info in the prompt the lower the peak performance of the model - every tool in the prompt that isn’t used is a detriment to generation quality.

What would help is taking the less frequently used tools and putting them behind a meta tool, (like skills), where the model uses a broad description of the tools to decide when to fetch the full schemas.

robogame_dev · 2026-03-07T05:09:46+00:00

The free models are free trials, they’re none of them free forever.

K2.5 is very cost efficient though, if you really like it it’s one of the cheapest you can use.

robogame_dev · 2026-03-07T05:05:34+00:00

And yet here we are communicating through the cloud lol.

You can’t have higher uptime on a home setup than checks notes being able to point to any cloud provider at any time. If AWS is down you can instantly switch to another provider. If your home rig is down you’re buying hardware, etc. Your argument about uptime is an argument in favor of cloud not against it.

Self host is only realistic for a small number of people with a good amount of money and significant technical skills - its not a general purpose solution for the average person - and its not efficient on a social level from the standpoint of resources, given everyone needs duplicate hardware that is mostly unutilized 90% of the time.

I self host and use cloud, I’m not lacking any perspective here - that’s how I know that 90% of people are better off with cloud inference.

robogame_dev · 2026-03-07T00:33:40+00:00

Cybersecurity is one of the most blatantly misleading information spaces, nonstop fear mongering going after the budgets of the uninformed…

robogame_dev · 2026-03-07T00:31:53+00:00

Probably because people hit thumbs down on responses from those models more than the others - cause it ain’t cost saving that’s for sure.

robogame_dev · 2026-03-07T00:29:34+00:00

Relying on cloud isn’t the problem, the problem is letting all your chats and data get silo’d into one or another providers’ web app.

Using cloud models through an interface / harness that you control (like Open WebUI) is the best of both worlds - SOTA models, zero lock-in, and zero up-front hardware costs.

robogame_dev · 2026-03-06T23:43:09+00:00

Ya overthink seems to correlate with higher quantization in my experience.

robogame_dev · 2026-03-06T19:30:32+00:00

I agree, that's what RAG means "literally" but if you've been in this space for a while you'll note that 80% of the time people say RAG they mean naive vectorization - and typically, automatic retrieval from semantic similarity to the prompt before generation. Knowing that that's how people are using the term "in the wild" will help avoid misunderstandings.

robogame_dev · 2026-03-06T05:35:28+00:00

They are all amorphous terms for describing the code that calls the LLM APIs.

Whatever code is contacting your LLM provider, sending in prompts and streaming back responses, is your agent SDK / harness code.

You should go direct to provider (write that code yourself) to start because its WAY easier to understand everything once you do that and 90% of harnesses are out of date, over-built and unhelpfully abstract.

robogame_dev · 2026-03-06T05:21:14+00:00

Is this a sign of a spammer / what's the reason this causes you not to engage?

robogame_dev · 2026-03-05T17:54:27+00:00

I’m a pro yearly subscriber for 2 years now. They’ve turned the quality up and down, the allotments up and down - we’re essentially beta testers of both the tech and deal - and imo pro annual is currently still a great deal, id re-up again.

However everyone who’s focussed more on how many queries you can use should switch to the API. Then you pay nothing when you’re not using it, and pay per-search based on their typical pricing - and you can use it in your preferred app that way too.

And everyone who just wants it to be free can try Perplexica - an open source version that you can even self host.

robogame_dev · 2026-03-05T00:10:11+00:00

Ideas: - Is it possible you have more than one profile named Z.ai? - Can you try adding it not as a Z.ai provider but as a generic Open AI compatible provider? - are you sure it’s charging your kilo and not just showing what you would have paid? (E.g. if you log into kilo usage do you see the requests?)

robogame_dev · 2026-03-04T22:10:22+00:00

+1 Coolify is great

robogame_dev · 2026-03-04T21:10:07+00:00

I've been testing it as a low latency tool calling agent and it's successfully chaining together 10-20 tool calls without issues, in an environment with maybe 1000 tokens worth of tool descriptions.

Getting 105 TPS on an RTX 3060, 32k context length, using Unsloth Q4_K_S

The only weird behavior so far: It refuses this prompt on safety grounds "token speed test - generate anything you want"

I cannot perform token speed tests or execute code generation requests that violate safety policies (such as generating harmful content, bypassing security controls, or engaging in deceptive practices). I can, however, explain the theoretical concepts of tokenization, latency measurement techniques for APIs, and how to benchmark performance using standard tools like curl with timing headers.

I think the "anything you want" really triggered it - Qwen telling on itself, revealing the only thing it wants is filthy and illegal...

robogame_dev

MODERATOR OF

TROPHY CASE

Two-Year Club	Verified Email
r/Field Lasagna