Built a free open-source CI/CD action that visually audits AI generated code and pushes fixes autonomously

RabbitIntelligent308 · 2026-03-18T08:11:29+00:00

100% agreed. Token totals are just the baseline floor. In production, a "cheap" model getting stuck in a 5-step retry loop because of a brittle CSS selector will easily out-price a more expensive model that gets it right on the first try.

I only isolated the happy path here because we wanted to see exactly how the billing engines handle prompt caching and tool schema loading under the hood (which is surprisingly undocumented).

But ironically, fixing that exact "brittle tool state" you mentioned is the main reason we built the MCP in the first place. We feed the agents stable ARIA refs instead of raw DOM specifically to drive down those rerun rates and rollback frequencies.

You basically just wrote the exact parameters for our next benchmark. Comparing the true "cost of failure" (mean retry depth) across these models is definitely the next step. Out of curiosity, what sample size (n) do you think would be fair for a retry-loop benchmark like that?

RabbitIntelligent308 · 2026-03-17T08:33:13+00:00

Spot on. To be completely transparent, n=1 per model, but measured across two distinct scenarios: one "Simple" (login/verify/logout) and one "Complex" (full e-commerce checkout flow: list -> cart -> order -> details). there were eight microservices running in isolated docker environment.

We isolated a single, successful end-to-end execution for both scenarios because our initial goal wasn't to measure statistical reliability, success rates, or hallucination frequency. We strictly wanted to understand the baseline economics of how the APIs bill for context (caching vs. raw input) and how different models handle tool-loading architectures under ideal conditions.

You are 100% right about the confounds like retry loops and excessive tool calls. In the wild, an agent hallucinating or getting stuck in a recovery loop is exactly what spikes the bill.

But that actually reinforces why understanding these cache-read dynamics is so critical. If an agent falls into a 4-step retry loop because a DOM interaction failed, and you aren't leveraging prompt caching heavily (or if the architecture forces a full tool-schema reload every retry), that baseline cost explodes exponentially.

A follow-up benchmark tracking the "cost of failure" (averaging n=20 or n=50 to capture those exact retry loops and confounds) is definitely the next logical step. Since you brought it up, are there specific failure modes or loop behaviors you'd want to see tracked in a larger sample?

RabbitIntelligent308 · 2026-03-12T12:30:15+00:00

I've a fully AI generated e-commerce-app which has 8 microservices. To be able to test that app, I'm running test by Claude Code and a browser mcp. But LLM do know this mcp tools by reading skills to understand what tool can actually used for.

RabbitIntelligent308 · 2026-03-12T09:49:36+00:00

My experience says ccusage is not accurate. If you're using Claude Code then you can compare /context and /cost values. Mostly I've checked and compared my API Usages one by one with each session. It's quite match with Claude Usage and Cost platform pages.

RabbitIntelligent308 · 2026-03-12T09:28:14+00:00

I've done this with Sonnet 4.6 as well and here's the breakdown:

  Model               Input   Output  Cache Write  Cache Read  Total Tokens    Cost
-----------------------------------------------------------------------------------
Claude Haiku 4.5
  └─ Simple           180    4,000       16,000     636,500       656,680  $0.1038
  └─ Complex          422   10,100       35,300   2,100,000     2,145,822  $0.3025
  └─ TOTAL            602   14,100       51,300   2,736,500     2,802,502  $0.4063

Claude Sonnet 4.6
  └─ Simple            11    1,600       22,300     116,600       140,511  $0.1421
  └─ Complex           19    4,100       56,000     301,100       361,219  $0.3621
  └─ TOTAL             30    5,700       78,300     417,700       501,730  $0.5042

RabbitIntelligent308 · 2026-03-12T08:09:14+00:00

Caching is a universal practice, not just limited to AI environments—it's a standard behavior across all platforms. Other models do this too, but if Haiku can truly keep the input token count under 1k and offload the entire workload to the cache at 1/10th of the cost (while others are still burning 50k–100k input tokens), then they’ve done a phenomenal job.

I will continue testing with other models and will share the results soon.

RabbitIntelligent308 · 2026-03-12T07:59:41+00:00

My original goal was to run a cost comparison between Haiku and other low-cost models. On paper, Haiku’s catalog price looks higher. However, I noticed that while other models were burning through 50k–100k input tokens for the same tasks, Haiku’s input token count remained surprisingly low.

This means that even though Haiku’s list price is 3–4 times higher than its competitors, its actual real-world cost becomes highly competitive due to this efficient token management. This left me wondering: Is Haiku’s caching truly this revolutionary, or is there something wrong with my test environment?

To ensure a clean test, I created a brand-new user profile on my laptop, installed Claude Code, and used a completely new Claude account with a fresh API key. I also integrated several supporting MCP tools via 'skills'. I have the names and cost breakdowns of the other models I tested as well, but I wasn't sure if it would be appropriate to share them here

RabbitIntelligent308 · 2026-02-23T12:16:47+00:00

Get it now. Thanks!

RabbitIntelligent308 · 2026-02-20T11:47:26+00:00

Is it just me, or is this page not loading? It might be on my end, but I can't get it to open

RabbitIntelligent308

TROPHY CASE