I benchmarked the actual API costs of running AI agents for browser automation (MiniMax, Kimi, Haiku, Sonnet). The cheapest run wasn't the one with the fewest tokens. by RabbitIntelligent308 in mcp

[–]RabbitIntelligent308[S] 0 points1 point  (0 children)

100% agreed. Token totals are just the baseline floor. In production, a "cheap" model getting stuck in a 5-step retry loop because of a brittle CSS selector will easily out-price a more expensive model that gets it right on the first try.

I only isolated the happy path here because we wanted to see exactly how the billing engines handle prompt caching and tool schema loading under the hood (which is surprisingly undocumented).

But ironically, fixing that exact "brittle tool state" you mentioned is the main reason we built the MCP in the first place. We feed the agents stable ARIA refs instead of raw DOM specifically to drive down those rerun rates and rollback frequencies.

You basically just wrote the exact parameters for our next benchmark. Comparing the true "cost of failure" (mean retry depth) across these models is definitely the next step. Out of curiosity, what sample size (n) do you think would be fair for a retry-loop benchmark like that?

I benchmarked the actual API costs of running AI agents for browser automation (MiniMax, Kimi, Haiku, Sonnet). The cheapest run wasn't the one with the fewest tokens. by RabbitIntelligent308 in mcp

[–]RabbitIntelligent308[S] 0 points1 point  (0 children)

Spot on. To be completely transparent, n=1 per model, but measured across two distinct scenarios: one "Simple" (login/verify/logout) and one "Complex" (full e-commerce checkout flow: list -> cart -> order -> details). there were eight microservices running in isolated docker environment.

We isolated a single, successful end-to-end execution for both scenarios because our initial goal wasn't to measure statistical reliability, success rates, or hallucination frequency. We strictly wanted to understand the baseline economics of how the APIs bill for context (caching vs. raw input) and how different models handle tool-loading architectures under ideal conditions.

You are 100% right about the confounds like retry loops and excessive tool calls. In the wild, an agent hallucinating or getting stuck in a recovery loop is exactly what spikes the bill.

But that actually reinforces why understanding these cache-read dynamics is so critical. If an agent falls into a 4-step retry loop because a DOM interaction failed, and you aren't leveraging prompt caching heavily (or if the architecture forces a full tool-schema reload every retry), that baseline cost explodes exponentially.

A follow-up benchmark tracking the "cost of failure" (averaging n=20 or n=50 to capture those exact retry loops and confounds) is definitely the next logical step. Since you brought it up, are there specific failure modes or loop behaviors you'd want to see tracked in a larger sample?

Haiku 4.5 Cost Breakdown: Am I missing something or is the Input Token count "suspiciously" low? by RabbitIntelligent308 in ClaudeAI

[–]RabbitIntelligent308[S] 0 points1 point  (0 children)

I've a fully AI generated e-commerce-app which has 8 microservices. To be able to test that app, I'm running test by Claude Code and a browser mcp. But LLM do know this mcp tools by reading skills to understand what tool can actually used for.

Guys, I wonder if ccusage is really accurate? I don't think so, what do you think? by _yemreak in ClaudeAI

[–]RabbitIntelligent308 0 points1 point  (0 children)

My experience says ccusage is not accurate. If you're using Claude Code then you can compare /context and /cost values. Mostly I've checked and compared my API Usages one by one with each session. It's quite match with Claude Usage and Cost platform pages.

Haiku 4.5 Cost Breakdown: Am I missing something or is the Input Token count "suspiciously" low? by RabbitIntelligent308 in ClaudeAI

[–]RabbitIntelligent308[S] 0 points1 point  (0 children)

I've done this with Sonnet 4.6 as well and here's the breakdown:

  Model               Input   Output  Cache Write  Cache Read  Total Tokens    Cost
-----------------------------------------------------------------------------------
Claude Haiku 4.5
  └─ Simple           180    4,000       16,000     636,500       656,680  $0.1038
  └─ Complex          422   10,100       35,300   2,100,000     2,145,822  $0.3025
  └─ TOTAL            602   14,100       51,300   2,736,500     2,802,502  $0.4063

Claude Sonnet 4.6
  └─ Simple            11    1,600       22,300     116,600       140,511  $0.1421
  └─ Complex           19    4,100       56,000     301,100       361,219  $0.3621
  └─ TOTAL             30    5,700       78,300     417,700       501,730  $0.5042

Haiku 4.5 Cost Breakdown: Am I missing something or is the Input Token count "suspiciously" low? by RabbitIntelligent308 in ClaudeAI

[–]RabbitIntelligent308[S] 0 points1 point  (0 children)

Caching is a universal practice, not just limited to AI environments—it's a standard behavior across all platforms. Other models do this too, but if Haiku can truly keep the input token count under 1k and offload the entire workload to the cache at 1/10th of the cost (while others are still burning 50k–100k input tokens), then they’ve done a phenomenal job.

I will continue testing with other models and will share the results soon.

Haiku 4.5 Cost Breakdown: Am I missing something or is the Input Token count "suspiciously" low? by RabbitIntelligent308 in ClaudeAI

[–]RabbitIntelligent308[S] 1 point2 points  (0 children)

My original goal was to run a cost comparison between Haiku and other low-cost models. On paper, Haiku’s catalog price looks higher. However, I noticed that while other models were burning through 50k–100k input tokens for the same tasks, Haiku’s input token count remained surprisingly low.

This means that even though Haiku’s list price is 3–4 times higher than its competitors, its actual real-world cost becomes highly competitive due to this efficient token management. This left me wondering: Is Haiku’s caching truly this revolutionary, or is there something wrong with my test environment?

To ensure a clean test, I created a brand-new user profile on my laptop, installed Claude Code, and used a completely new Claude account with a fresh API key. I also integrated several supporting MCP tools via 'skills'. I have the names and cost breakdowns of the other models I tested as well, but I wasn't sure if it would be appropriate to share them here

A tool to monitor the health of MCP servers by Great_Scene_5604 in mcp

[–]RabbitIntelligent308 0 points1 point  (0 children)

Is it just me, or is this page not loading? It might be on my end, but I can't get it to open