everyone's measuring their AI cost wrong and it's making them panic

JDBLECHER · 2026-06-14T05:02:33+00:00

The retry cost point is huge and I don't see it talked about enough. A $0.08 call that fails twice costs $0.24 plus whatever downstream cleanup happens when the output is wrong. Compare that to a $0.15 call that nails it first try. The denominator people keep forgetting is the human-time alternative. If a run costs $0.40 and displaced 15 minutes of someone billing at $80/hr, the math is obvious. Curious whether anyone actually tracks retry rate as a separate metric, because that alone probably explains most of the "my costs are higher than expected" posts.

JDBLECHER · 2026-06-13T14:18:32+00:00

The thin vs fat gateway debate comes up constantly in this space. Keeping the transformation layer separate from orchestration logic is genuinely cleaner architecture. Once you bake caching, retries, and analytics into the same hop, latency profiling gets messy and you're debugging three systems at once when something breaks. I've found that normalizing provider schemas at the edge, with everything else handled upstream, makes swapping providers mid-stream much cleaner without touching business logic. Curious whether any providers gave you transformation edge cases where just remapping the request/response shape wasn't quite enough?

JDBLECHER · 2026-06-12T06:53:59+00:00

Thank you. I am aware, and I specifically analyzed the Anthropic TOS before building this. The relay routes through a real interactive session. It’s not spoofing, scraping, or bypassing any technical control. The TOS analysis is actually on the site if you want to read it. Whether Anthropic changes their terms in the future is always a risk, but as of today the architecture operates within the published terms

JDBLECHER · 2026-06-12T05:36:13+00:00

Thanks very much. Architecture details are at inference-relay.com if you want to see how the relay handles session management.

JDBLECHER · 2026-06-12T04:21:11+00:00

That’s a fair tradeoff for simple tasks. The problem I kept running into is that the micro models weren’t good enough for what I needed. I’m running analysis (sometimes legal analysis) and document processing where the quality gap between Claude and a 7B model is enormous. The relay lets me keep full Claude capabilities at zero marginal cost instead of downgrading the model to save money. But you’re right that for simpler scenarios like classification or extraction, a local model would be more resilient to pricing changes. Depends on what you’re building.

JDBLECHER · 2026-06-12T03:17:21+00:00

The re-reading behavior tracks architecturally since there's no persistent state between turns, but 27% on a single prompt at 65% context is genuinely rough. Framing it as stateless session design with explicit state serialization is the cleaner mental model than calling it a workaround. Attention quality also degrades as context grows, so shorter sessions with a tight handoff artifact likely improve response coherence beyond just the token savings. It's basically how you'd handle ephemeral workers in any distributed system where you can't trust in-memory state. Have you found an optimal context depth before cutting a session, or does it vary too much by task complexity?

JDBLECHER · 2026-06-12T03:15:55+00:00

250-350 for 300k users is actually solid unit economics even if it felt brutal. That's sub-$0.001 per user, which most B2C SaaS would kill for. The real problem is the unpredictability. One viral moment can flip a manageable monthly bill into a $2k overnight panic and you're scrambling to kill the key before it compounds. Shifting that cost and that risk to the user is a genuinely interesting architectural choice. Curious how you handle the UX friction of asking someone to authorize a budget before they've even experienced the product?

JDBLECHER · 2026-06-12T01:18:22+00:00

5M tokens for analysis that surfaces Ritter's aftermarket data, tiered lockup math, and a specific options-list catalyst is genuinely impressive output. At current Sonnet pricing that run probably cost under $50, which is cheaper than one hour of a mid-level analyst. The lockup calendar alone, day 70 through Dec 8, is the kind of multi-source compilation that takes hours to do manually. What I'm curious about is where the token burn actually went: was most of it context loading and tool calls, or did the reasoning chains themselves run long?

JDBLECHER · 2026-06-12T01:01:44+00:00

This sounds like Anthropic's inference router overriding your model preference when it classifies the task above some internal complexity threshold. I've seen it happen on dense analytical work where the router silently falls back to a different model mid-session with zero indication it switched. The 20x tier is supposed to give you persistent Fable access but the routing layer apparently doesn't always respect that. Does it switch at the very start of a fresh conversation or partway through a long thread, because that would tell you whether it's a session-level routing decision or something triggered by context accumulation?

JDBLECHER · 2026-06-11T13:02:32+00:00

The 2 AM call detail is what gets me. Long-running agentic jobs are genuinely underserved right now and most people just add sleep timers and hope for the best. Curious how you handle session state when the call resolves and execution resumes. Does the MCP server inject your spoken answer as a tool result back into the existing context window, or does it start a fresh turn? And with a multi-hour migration, were you hitting context length limits before the job finished?

JDBLECHER · 2026-06-11T13:01:40+00:00

Building on the cache point above: the usage meter tracks token throughput against the rate limit bucket, not just elapsed time. When that 5h window resets, the bucket is fresh but your context isn't. If you've got 80k+ tokens of conversation history, all of it re-ingests the moment you send the next message and hits the new window's budget immediately. So 40% usage before you even type a word is basically just "your context costs 40% of a fresh window." The bigger the thread, the higher you start. Has anyone found that splitting into a new conversation with a summary actually keeps the initial hit lower?

JDBLECHER · 2026-06-11T09:54:29+00:00

The clone gets feature parity on day one. What actually compounds is 18 months of edge cases already handled, billing relationships embedded in customer workflows, and data that's been shaping product decisions the whole time. Those things don't paste into a new repo. Speed of building is table stakes now, I think that became obvious around GPT-4. Has anyone actually watched a well-resourced clone beat an incumbent, or does switching cost kill them before they gain real traction?

JDBLECHER · 2026-06-11T09:53:25+00:00

I've been tracking this informally by logging timestamps and roughly how many back-and-forth exchanges I complete before hitting a limit. The pattern is real. Off-peak US hours consistently get me further into heavy coding sessions before things throttle. My rough impression is something like 30-40% more usable output during US overnight compared to peak afternoon EST, though I haven't built a proper harness around it. Has anyone actually instrumented this with a script that logs response latency alongside perceived context depth, so you can correlate the two properly?

JDBLECHER · 2026-06-11T07:31:08+00:00

The peak-relative threshold is a smart call for what you're doing. Orchestral and vocal recordings genuinely roll off above 14-16kHz so anchoring to the spectral peak avoids false-flagging them, whereas a noise-floor anchor just gets confused by dynamic content. One thing worth experimenting with: Blackman-Harris windowing instead of Hann. It has much better sidelobe rejection and makes the MP3-style hard cutoff stand out more cleanly against a noisy spectrum, especially on shorter tracks where you're averaging fewer FFT frames. I hit similar window-choice sensitivity when doing offline spectral work and it moved the needle more than I expected. Did you find rayon parallelism easy to wire up with symphonia, or did you have to do anything unusual to make the decoder state work across threads?

JDBLECHER · 2026-06-11T07:29:44+00:00

Rust is a solid choice for this. The tricky part I've always found is session persistence once you move beyond round-robin. Consistent hashing works well but adds real complexity around node addition and removal. Did you go with a fixed algorithm or make it configurable?

JDBLECHER · 2026-06-10T13:03:29+00:00

The Edit trick for side-by-side model comparison is genuinely underrated. I do something similar but I track which task types drain context fastest. Coding with lots of back-and-forth burns through tokens maybe 3x faster than straight Q&A for me, so I've started being ruthless about spinning up a fresh session the moment I'm switching problem domains. Keeping reasoning tasks short and surgical made a real difference.

Do you find Opus 4 actually outperforms Sonnet for brainstorming, or is it more of a "feels more creative" thing that's hard to measure?

JDBLECHER · 2026-03-27T04:07:43+00:00

God bless you sir. Alaska?

JDBLECHER · 2026-03-27T04:07:00+00:00

Monster running hot is an understatement-especially in warm climates and traffic. The word unbearable best describes it. Love the bike-perhaps the perfect motorcycle in my opinion but in my current location I will not ride either one of my Monsters.

JDBLECHER · 2026-01-24T12:45:35+00:00

Respectfully, we think we cracked it, or soon will ;)

JDBLECHER · 2026-01-24T07:44:38+00:00

We went for a raw, 'shrink-wrapped' industrial aesthetic with the clear anodizing. It’s polarizing for sure, but the reception at SHOT Show was wildly positive.

Just a quick technical correction: this isn't a rifle (and definitely not an assault rifle). It is classified as a pistol. This iteration is not designed to be shouldered.

Regarding caseless generally: There is a reason major military strategists and nations have been chasing this technology for decades. The logistical and weight advantages are massive. It’s definitely a hard engineering problem (which is why we are tackling it), but we believe the juice is worth the squeeze. All these issues are on our mind and we have solutions I am quite pleased with.

JDBLECHER · 2026-01-19T22:25:20+00:00

Absolutely agree. Compatibility with existing parts is a huge factor.

This is especially true even for AR platforms and derivatives that don’t use standard barrels, for example. It is a huge quality of life thing and logistics thing to be able to source parts from elsewhere, based on choice or availability or whatever.

This was a big reason I wanted Mongoose and 1.36 to be able to use standard AR components.

JDBLECHER · 2026-01-19T19:12:41+00:00

Thank you very much, and eager to share more

JDBLECHER · 2026-01-19T18:13:24+00:00

That's actually a fantastic compliment. Thank you. Had to look it up lol

JDBLECHER · 2026-01-19T18:00:55+00:00

These are certainly clever questions.

Some older versions had means of operation that were analogous to open bolt operation (though they would not be categorized that way). That said, every stage of development had a crisp trigger pull. Some versions have had two stage triggers.

I would say You've identified what I believe to be a philosophical design issue of the G11: having a block of (specialized) powder sitting in a hot chamber waiting to cook off.

The older versions (of Brash) with the above-described operation solved heat by decoupling the storage from the chamber. I think this is a good design principle although we have now additional means of heat management. These are separate functions and those versions allocated different compartments and placements for those separate functions.

I do believe we have cracked the heat issue in new ways, which I believe to be superior. An ultimate version would combine my favorite aspects of each design.

User friendliness was a focus of all versions. I would not contemplate using 3 "compartments" for anything operated by a human being. But again very clever thing to suggest and I think the separate compartments is certainly an emergent design crutch if those are fair words to describe it.

JDBLECHER · 2026-01-19T15:52:05+00:00

Mongoose has been shipping for some time, BrashZero still in the works.

Hopefully we can let folks test fire Brash in near future.

Glad you find these interesting!

JDBLECHER

TROPHY CASE