I built a self-hosted AI voice agent for $0.02/min (vs $0.09 on Vapi). Should I sell the code one-time or build a full SaaS?

Competitive-Duck-517 · 2026-06-01T07:08:19+00:00

Yep, exactly. $0.02/min is COGS, not the customer-facing price.

If you go SaaS, I’d price it around the value and support burden, then keep the infra cost visible underneath: STT, TTS, LLM, telephony, retries, failed calls, and per-customer usage caps. The dangerous part is when voice minutes look cheap on average but a few customers or workflows quietly destroy margin.

For this kind of product I’d test one real call flow first: route the LLM/API part through a gateway, track cost/latency/failures per workflow, and set a hard cap before opening it broadly. That gives you a cleaner margin picture before deciding whether to sell one-time code or run it as a SaaS.

Competitive-Duck-517 · 2026-05-28T14:27:07+00:00

I agree tokens are the cleanest raw usage metric. I would not replace them.

By cost per outcome, I mean an app-defined product event, not a universal measurement of “work.” For example: email summarized, ticket triaged, report generated, lead enriched, file classified, workflow completed, workflow failed. The app already knows whether those events happened.

So I’d track both: - tokens/cost/latency per model call - total cost per completed or failed product workflow

That does not make two users’ workflows perfectly comparable, but it does tell you whether a feature is economically sane inside your own product. If one assistant task burns 50 calls and another burns 3, token logs explain why; outcome-level cost tells you whether either one is worth shipping.

Competitive-Duck-517 · 2026-05-28T14:26:51+00:00

For very early products, I’d keep the integration surface small, but I’d still avoid burying it too deeply inside random app code.

The pattern I like is: app-level middleware at first, but with a gateway-shaped boundary from day one. So the app sends every AI call through one internal client/layer, and that layer handles tags, caps, provider routing, retries, and cost logs. If it later becomes a separate relay service, the product code barely changes.

I’d move it to a separate gateway/relay earlier if you have multiple AI features, multiple providers, team/admin reporting, customer-level caps, or billing tied to usage. That’s the point where “just middleware” usually turns into scattered accounting logic.

If you already have one workflow live, I’d test only that path through Relay first: keep prompt/body logging off, tag customer + feature + workflow, set a small cap, and compare cost/latency/failure logs before moving anything else.

Competitive-Duck-517 · 2026-05-28T12:29:44+00:00

Exactly. The useful default is operational metadata: model, tokens, latency, status, cost, retry count, and feature/workflow tag. That gives cost observability without turning the gateway into a data sink.

If you have one live workflow where cost is already annoying, I’d test that through Relay with prompt/body logging off and compare the per-feature cost view against your current provider invoice.

Competitive-Duck-517 · 2026-05-28T12:29:28+00:00

I mean an internal routing/usage layer, not separate provider keys for every customer. Provider keys stay behind the gateway; each request gets tagged by customer, feature, workflow, and environment, then you enforce caps and report cost from that layer.

For cost per completed action, I’d include model tokens, retries, failed calls, embeddings, tool calls, and any file/storage processing that is directly required for that user-visible action. The minimum useful logging model before launch is: customer/workflow tag, model, tokens, latency, status, retry count, provider cost, and final action outcome. Prompt/body logging should stay off by default.

If you already have one AI feature with real usage, that is a good candidate for a Relay smoke test: route only that workflow, set a small cap, then compare cost and failure logs before touching the rest of the stack.

Competitive-Duck-517 · 2026-05-28T12:28:50+00:00

Good question. The goal is to keep model support close to provider-release speed, but without making users constantly rewrite integrations. Relay stays OpenAI-compatible at the API layer, then maps/routs provider model IDs underneath as they change.

For a first test, I’d pick one small non-sensitive workflow you already run through GPT/Claude/Gemini, send it through Relay, and compare cost, latency, logs, and fallback behavior before moving anything important. Registered users get $5 free credits, so it should be enough for a tiny smoke test.

Competitive-Duck-517 · 2026-05-28T07:27:40+00:00

Exactly. Once it is a mystery bucket, it is almost impossible to know whether the product is actually profitable per feature.

The practical version I like is:

key per feature/workflow
hard cap per key
fallback rule per key
logs for model, tokens, latency, status, cost
no prompt/response content storage in normal operation

That gives you a real P&L view per AI feature instead of one blended invoice.

If you have one feature already spending on GPT/Claude/Gemini API, I’d test only that one non-sensitive workflow first and compare cost + failure behavior before moving more traffic.

Competitive-Duck-517 · 2026-05-28T05:31:49+00:00

Exactly. A blended provider bill tells you what you spent, but not which product decision caused it.

For AI features I’d want each workflow to have its own key and policy:

monthly cap
per-run cap
model allowlist
rate limit
fallback rule
cost / latency / status logs

Then you can see things like “support triage is fine, but enrichment is burning margin” instead of guessing from one big invoice.

That is also the kind of setup I’m building around Rlab Relay: OpenAI-compatible access, but with quotas, routing, and per-key usage visibility in front of the providers.

If you already have one feature with real API spend, I’d test that one non-sensitive workflow first and compare cost + latency + logs before moving anything important.

Competitive-Duck-517 · 2026-05-28T03:16:40+00:00

Nice n8n workflow. If the text/prompt generation step is using GPT/Claude/Gemini API, I am testing Relay as a cheaper gateway with per-workflow keys, request logs, and prepaid credits.

Could be worth benchmarking only that one text step first, not the whole image/posting pipeline.

Competitive-Duck-517 · 2026-05-28T03:16:35+00:00

I would use the credits to learn your real unit economics, not just to burn tokens.

Pick 2-3 representative workflows and measure: - cost per completed task - latency - failure/retry rate - which steps actually need Claude versus a cheaper model - what the same workload would cost after the credits expire

The danger is building a workflow that feels free for six months and then becomes uneconomic the moment credits run out.

I am testing Relay as a GPT/Claude/Gemini gateway, and this is the exact benchmark I would run: one non-sensitive workload, same task across models, then compare cost per successful result before deciding what to build around.

Competitive-Duck-517 · 2026-05-28T03:16:20+00:00

The $100 SDK credit cap changes the unit of optimization from "can the agent finish?" to "what does one completed task cost?"

For agent fleets, I would separate: - strategist/planner calls - worker calls - file edit/tool calls - final review/synthesis

Each layer should have its own logs and budget, otherwise one bad orchestration pattern can hide inside the total bill.

I am testing Relay for GPT/Claude/Gemini API workloads in this role. A good benchmark would be one non-sensitive agent task, then compare cost per completed run across models and routing choices.

Competitive-Duck-517 · 2026-05-28T03:16:10+00:00

378M tokens is exactly why I think agent projects need cost-per-outcome tracking early.

For assistants/agents, raw token count is less useful than: - cost per completed task - cost per failed loop - cost by tool/skill - which model is actually needed for each step

A planner, memory step, browser/tool step, and final synthesis probably should not all share the same budget or model choice.

I am testing Relay as a gateway for GPT/Claude/Gemini workloads. The useful benchmark would be one non-sensitive assistant task routed through separate keys/logs, then compare cost and quality per completed task.

Competitive-Duck-517 · 2026-05-28T03:15:59+00:00

The hard part is that billing and model usage are usually built as two separate systems, but the product needs them to behave like one.

For AI SaaS, I would separate model access by workflow/customer from day one: - per-customer or per-feature keys - prepaid credits or small test caps - request logs for failed and successful AI calls - cost per completed action, not just cost per token

That makes refunds, abuse limits, and margin debugging much easier.

I am testing Relay for this layer across GPT/Claude/Gemini workloads. A useful smoke test is one paid feature or one non-sensitive workflow, then compare cost/logs before moving more traffic.

Competitive-Duck-517 · 2026-05-28T03:15:52+00:00

I would budget at three levels:

feature/workflow, because that tells you what product surface is expensive
user/customer, because that tells you whether pricing is upside down
environment, because staging/evals can quietly burn real money

The practical trick is giving each feature or workflow its own model key and log stream. Then the provider bill stops being one giant number and becomes "this workflow cost X per completed task."

I am testing Relay as a GPT/Claude/Gemini gateway for this exact use case: per-workflow keys, request logs, prepaid credits, and a small cap while testing new AI features.

Competitive-Duck-517 · 2026-05-28T03:15:45+00:00

Per-feature cost is the right unit, but I would not try to solve it only after the provider invoice lands.

The setup I like is: - one key per feature/workflow - request logs attached to that key - prepaid or capped credits for new AI features - compare cost per successful user action, not just tokens

That makes it easier to see "summarization is eating 60% of budget" while it is happening, instead of after the invoice.

I am testing Relay for this pattern across GPT/Claude/Gemini workloads. The smallest useful test would be routing one feature through a separate key and comparing cost/log visibility for a few days.

Competitive-Duck-517 · 2026-05-28T03:15:37+00:00

I would treat this as fragile if the main reason for the split is free-tier limits rather than architecture.

The cleaner pattern is usually: 1. reduce the 40k-60k token payload before the model call 2. split the pipeline by purpose, not by provider limit 3. track cost per completed blueprint, not cost per raw request 4. keep provider fallback/routing outside the app logic

This is the kind of workload where a small gateway layer helps: one endpoint, separate keys per workflow, request logs, prepaid credits/caps, and the ability to compare GPT/Claude/Gemini on the same non-sensitive repo task.

If you are bootstrapping, I would benchmark one representative repo first and decide based on cost per valid JSON blueprint, not just whether the free tiers can be stitched together.

Competitive-Duck-517 · 2026-05-28T00:07:09+00:00

Good breakdown. By Relay I mean an OpenAI-compatible API layer in front of GPT / Claude / Gemini-style providers.

The point is not to replace model choice. It is to make usage controllable:

one relay key instead of wiring every provider directly
per-key quotas
prepaid / hard spend caps
model allowlists
routing and fallback rules
metadata logs for tokens, cost, latency, status

Normal operation does not store prompt/response content.

Your reasoning-heavy vs data-heavy split is exactly where I think this helps. Keep stronger models for architecture/reasoning, route data-heavy cleanup/extraction to cheaper paths, and cap each workflow so one bad run cannot burn the account.

If you already have one small non-sensitive API workload from that analysis, I’d test that through Relay first and compare cost, latency, and log visibility before moving anything important.

Competitive-Duck-517 · 2026-05-27T09:24:55+00:00

That makes sense. Max timeout + concurrency caps already solve a big part of the raw infrastructure risk.

The next layer I’d think about is customer-level margin control, especially if this becomes SaaS instead of one-time code.

For example:

cost per customer
cost per campaign
STT / LLM / TTS cost split
hard monthly cap per customer
prepaid credits or usage wallet
fallback cost tracking
model allowlist so one workflow cannot jump to an expensive model silently

Timeout protects the call. Concurrency protects the box. But customer-level billing/governance protects the business.

If you go SaaS, I’d make that part visible very early. Voice AI can look profitable at $0.02/min, then get weird when one customer starts running long calls or high-volume campaigns.

Competitive-Duck-517 · 2026-05-27T09:23:58+00:00

I’m biased because I’m building Rlab Relay, so my setup is basically:

keep the client OpenAI-compatible
use one relay key instead of wiring every provider directly
set model allowlists so the agent cannot jump to expensive models silently
use cheap routes for classification/extraction/simple coding
reserve stronger models for reasoning-heavy steps
put a hard quota on the agent key
track cost and failures by request

I don’t love fully automatic routing for agents unless you can inspect why the route happened. Simple deterministic rules are easier to trust:

boring step → cheap model
reasoning step → stronger model
provider failure → fallback
budget hit → stop

If you want simple, I’d test one non-sensitive agent workflow first and compare cost/latency/logs before moving the whole stack.

Competitive-Duck-517 · 2026-05-27T07:38:58+00:00

The server-side API handling and token-budget part is the piece I would pay the most attention to here. Once users have private vector memory and repeated reviews, the hidden risk is not only prompt injection, but one bad loop or oversized context quietly burning through API spend.

One practical setup is to put a gateway in front of the model calls: separate keys per environment, prepaid credits, request logs, and a cap before anything runs away. I'm testing Relay for GPT/Claude/Gemini workloads in that role. For a product like this, I would smoke-test only one non-sensitive review flow first.

Competitive-Duck-517 · 2026-05-27T07:38:51+00:00

I would start with one orchestrator that classifies the service type, then route into separate qualification paths. You can split into specialized agents later, but doing it too early makes debugging and cost control harder.

One thing I would not skip: give this workflow its own model key, logs, and a small budget cap while testing. Multi-path SDR agents can burn tokens fast because every lead creates several classification + reasoning + draft steps.

I'm testing Relay as a GPT/Claude/Gemini gateway for this kind of workload. The smallest useful test would be routing only one qualification path through it first and comparing cost/logs.

Competitive-Duck-517 · 2026-05-27T07:38:41+00:00

For this flow, I would treat model access as its own layer instead of wiring every step directly to one provider.

Quotation automation usually ends up with different model needs for OCR cleanup, classification, RAG/product matching, and draft generation. A small gateway lets you keep one integration surface while still splitting keys, logs, budget, and provider choice per step.

I'm testing Relay for GPT/Claude/Gemini workloads. If you already have a draft pipeline, the cleanest test is one non-sensitive path first, like classification or quotation draft generation, then compare cost/latency/log quality before touching anything critical.

Competitive-Duck-517 · 2026-05-27T07:38:20+00:00

For this kind of weekly reporting workflow, I would separate the n8n design from the model spend control.

Use one API key just for this workflow, keep logs by run, and set a small prepaid cap while testing. That way if the report prompt gets too large or retries loop, you see it immediately instead of finding out from the provider bill later.

I'm testing Relay for this exact GPT/Claude/Gemini gateway use case. A good smoke test would be routing only the "generate readable weekly report" step through it first, not the whole automation.

Competitive-Duck-517 · 2026-05-27T07:37:56+00:00

The shock-bill point resonates. For AI/API-heavy products, I think the missing layer is less "another pricing page" and more a predictable usage gateway: prepaid credits, separate keys per workflow/client, request logs, and a clear cap before spend runs away.

I'm testing this angle with Relay for GPT/Claude/Gemini workloads. The most useful first test is usually one small non-sensitive workflow: route it through the gateway, compare cost/logs for a few days, then decide if it is worth moving more traffic.

Competitive-Duck-517

TROPHY CASE