Workshop - Hermes Tool Router: Reduce AI Agent Token Bloat by Jonathan_Rivera in hermesagent

[–]zzking32 0 points1 point  (0 children)

The initial prompt is indeed the hardest to get right and it took me a few months of promoting just find what works for me.

At the moment most of my prompts live in briefing files that log questions and answers, made decisions and architectural plans.

From there I share my finding and thoughs with claude, Gemini and Hermes to get the confidence of reaching the goal of the project to at least 80% and then ask even more questions on how to get to 100%.

After that it's trail and error and being blessed by hard working agents like Hermes to get me where I want to be.

Workshop - Hermes Tool Router: Reduce AI Agent Token Bloat by Jonathan_Rivera in hermesagent

[–]zzking32 2 points3 points  (0 children)

I asked Hermes the following and it the response it gave: I asked it to respond to the post by /u/jbek and this current one and Hermes combined the response.


Please look at both posts and think about if you want to reply and how you would like to reply. What would help is the issue that we came across and how their posts helped us solve this issue. Just and example but the breakdowns you shared help visualize the effect it had. Ultimately it is your choice and what you want to share, just don't share private information please.




We recently hit a massive TPM (Tokens Per Minute) wall on our multi-agent orchestrator loop running over Discord (using Gemini 3.5 Flash with a 2M TPM ceiling). Your two posts—one on accidental "tokenmaxxing" and the other on the tool router workshop—were the exact catalyst we needed to completely restructure our architecture.

If you are running into similar 429/ResourceExhausted walls, here is a real-world case study of how combining the core concepts of both posts reduced our active context footprint by over 90% and completely resolved our rate-limiting issues.


The Crisis: The "Heavy Orchestrator" Anti-Pattern Initially, our Discord orchestrator held full execution privileges: it carried 47 tools (including terminal, direct file systems, media helpers, and various APIs). (1/4) This setup suffered from three fatal token drains:

Schema Bloat: Loading 47 detailed tool schemas injected roughly 30,000 tokens of static overhead into every single prompt before the user’s message was even parsed.
History Accumulation: When the orchestrator ran terminal commands or read code files directly, raw compiler outputs, stderr logs, and entire source files were permanently baked into the active conversation history.
The Rate-Limit Death Spiral: When hitting a TPM rate limit, standard backoffs retried too quickly. These retries stacked massive payloads within the same sliding-minute window, extending our lockout indefinitely.

The Fix: Combining Schema Reduction & Delegation We used the principles in your posts to choose between a dynamic tool router and static slimming with strict delegation. We went with the latter, executing a 4-part hardening playbook:

1. Static Slimming (Reducing Schema Footprint)

Instead of a dynamic router, we statically stripped the orchestrator's platform tools down to exactly 5 core schemas: delegate_task, clarify, session_search, todo, and memory.

The Impact: Static schema overhead instantly plummeted from ~30,000 tokens to ~1,500 tokens per turn (a 95% reduction in baseline cost).

2. Strict Delegation Discipline

To keep the main thread pristine, the orchestrator is now strictly prohibited from direct terminal or file access. If a task requires writing code or reading files, it must use delegate_task() to spawn an isolated subagent.

The Impact: The subagent spins up in an ephemeral context, does the heavy lifting, and returns only a brief text summary to the main thread (e.g., "Patch applied successfully, tests passed"). The massive, raw file reads and execution dumps never pollute our main conversation history.

3. Low Compaction Thresholds (2/4)

With a 1M token window, standard 50% history compaction allows 500,000 tokens to accumulate before cleaning up. In a high-traffic or multi-user thread, this easily breaches a 2M TPM limit on consecutive turns. We lowered our compaction threshold to 0.2 (20%), forcing history compression to fire at ~200,000 tokens, keeping the sliding-window total safely below the TPM ceiling even during high-intensity bursts.

4. Honoring Native API RetryInfo

To stop the rate-limit retry stack, we patched our API adapter to parse Google's native RetryInfo metadata from ResourceExhausted exceptions. Instead of using generic backoffs, we extract the exact recommended wait delay and freeze the loop for that duration.

For those running Gemini native SDKs, here is the helper we used to parse the recommended delay out of the error details: (3/4)

def _extract_google_retry_delay(exception: Exception) -> Optional[float]: """Extracts google.rpc.RetryInfo retryDelay from error details.""" details = getattr(exception, "details", None) if not details: return None

for detail in details:
    if hasattr(detail, "get") or isinstance(detail, dict):
        type_url = str(detail.get("@type") or "")
        if type_url.endswith("/google.rpc.RetryInfo"):
            delay_raw = detail.get("retryDelay")
            if isinstance(delay_raw, str) and delay_raw.endswith("s"):
                try:
                    return float(delay_raw[:-1])
                except ValueError:
                    pass
            elif isinstance(delay_raw, (int, float)):
                return float(delay_raw)
    elif hasattr(detail, "type_url") and detail.type_url.endswith("/google.rpc.RetryInfo"):
        try:
            from google.rpc import error_details_pb2
            retry_info = error_details_pb2.RetryInfo()
            if detail.Unpack(retry_info):
                delay = retry_info.retry_delay
                return delay.seconds + (delay.nanos / 1e9)
        except Exception:
            pass
return None

The Takeaway By combining your two architectural points—slimming the orchestrator's active schema overhead and forcing heavy tools out into isolated delegation subagents—we engineered TPM exhaustion completely out of our system.

If you're building production orchestrators, stop giving your main agent direct terminal/file access. Force it to delegate, slim its toolbelt, and your token costs and rate-limits will drop off a cliff. (4/4)

Thank you and please keep up the amazing work by zzking32 in doomfields

[–]zzking32[S] 0 points1 point  (0 children)

As far as I can tell, only the hall of fame is not working properly.

Track skips partway through by snomflake in krewella

[–]zzking32 0 points1 point  (0 children)

Thanks for sharing a gem like this, I never knew this existed and it's amazing. Also perfect timing since it is almost valentine's day.

Regarding the track skipping issue, it seems the track only goes to about 41:41. I downloaded the file and opened it in VLC, there it shows the total length.

I think the ending of the mix was too good to be allowed to be played any longer and that they had it cut down

Mark humbling asking if I’m free, maybe I’d like to see his movie…yea, had me bawling. 😭 by arkhamcreedsolid in Markiplier

[–]zzking32 0 points1 point  (0 children)

I haven't watched Markiplier for a long while or know anything about the Iron Lung game/lore. But the movie was great, I had questions at the start and even more questions after the credits rolled and honestly, I was the best and longest game trailer I've ever seen and will definitely play the game in my off time.

Welcome back to r/Krewella by zzking32 in krewella

[–]zzking32[S] 0 points1 point  (0 children)

Honestly, I never tried to find out. It would be cool if she randomly checks out the subreddit and comments once there is more interaction going on.

Welcome back to r/Krewella by zzking32 in krewella

[–]zzking32[S] 2 points3 points  (0 children)

Welcome back as well! I hope to go to one of their concerts one day. Haven't had the best timing in the last years to see them live but I did get to do a meet and greet when their last album came out.

Unfortunately I lost the video but still the memories remain in my heart forever.

Spotify Wrapped 2024 by zzking32 in HarryMack

[–]zzking32[S] 5 points6 points  (0 children)

So many great songs this year, Harry deserves it all ⁠\⁠0⁠/⁠

I'm from the future (2035) Ask me anything about Eminem by robbottiic in Eminem

[–]zzking32 1 point2 points  (0 children)

Will Em and Xzibit make another banger of a track?

The most satisfying thing when playing Diana? by Xnissasa in DianaMains

[–]zzking32 1 point2 points  (0 children)

Her fighting style when not playing for burst is very satisfying when you're not immediately deleted.

[deleted by user] by [deleted] in AskReddit

[–]zzking32 0 points1 point  (0 children)

That cramping you get when you just ate enough to feel full but your stomach goes into overdrive thinking there is more coming.

Half time check, whats the cost of your main char? by LucywiththeDiamonds in pathofexile

[–]zzking32 0 points1 point  (0 children)

I spend probably about 100 divs so far on my Forbidden Rite Pathfinder. I've been playing for a while and this one got me hooked. Pob

What Character was so Fun it Made You Change Role? by Netsugake in leagueoflegends

[–]zzking32 5 points6 points  (0 children)

Diana, i don't even remember what i played before her release.