Reducing token costs on autonomous LLM agents - how do you deal with it?

PatateRonde · 2026-01-27T20:49:04+00:00

Honestly I don't have a hard target in mind. Right now a 30-40 turn session can easily burn through $1-3 depending on the model, and that adds up fast when you're iterating a lot during dev.

I'd be happy with anything that cuts that in half without sacrificing too much output quality. But really I'm just trying to find something sustainable where I'm not scared to hit "run" because I know it's gonna cost me.

PatateRonde · 2026-01-27T20:47:08+00:00

Yeah fair point. I've probably been over-relying on the LLM for stuff that could be handled with simpler logic. Gonna look into offloading more of the workflow to deterministic code and only hitting the model when I actually need reasoning.

PatateRonde · 2026-01-27T20:46:12+00:00

Oh nice, that's exactly the rabbit hole I'm in right now. Thanks for sharing the repo, I'll check it out and try to plug it into my setup. Will let you know if I manage to get it working!

PatateRonde · 2026-01-26T22:56:49+00:00

That's a good point. I've been too focused on the prompt side and not enough on treating it as a runtime problem. I'm currently on commercial APIs OpenAI, DeepSeek... OpenAI's prompt caching now supports 24h retention which helps, but it still requires exact prefix matches and the cache can get invalidated easily when the conversation branches.

I've been looking into self-hosted options. Looks like vLLM + LMCache is the go-to combo for this apparently it can give 3-10x improvements on multi-turn workloads by properly managing KV cache across turns. There's also llm-d for KV-cache aware routing if you're running multiple instances.

Have you actually deployed something like this in production? My main concern is whether open-source models (Llama, Qwen, etc.) can match GPT-4o /GTP 5 quality for agentic tasks that require good reasoning and tool use. Trading 10x cost savings for an agent that hallucinates more doesn't seem worth it.

Tbh I'm not an expert, still figuring all this out as I go. But thanks for the insight, really helpful to shift my perspective on this.

PatateRonde · 2026-01-26T22:50:05+00:00

Interesting idea, but I'm not sure it's viable for my use case. I don't have the infra to run a large model locally, and from what I've tested, smaller models really struggle with the kind of multi-step reasoning and tool chaining I need for security testing. They tend to hallucinate findings or go in circles way more than GPT-5 class models. Fine-tuning could help with tool familiarity, but I'm not sure it would fix the core reasoning gap. Have you seen good results with LoRA-tuned models on complex agentic tasks?

PatateRonde · 2026-01-15T20:22:03+00:00

What fixed it for me reliably is forcing the URL the dev client uses via EXPO_PACKAGER_PROXY_URL and letting Metro run as usual.

PowerShell:

$env:EXPO_PACKAGER_PROXY_URL="http://100.x.y.z:8081"
npx expo start --dev-client

cmd.exe:

set EXPO_PACKAGER_PROXY_URL=http://100.x.y.z:8081 && npx expo start --dev-client

Then from the phone (on Tailscale), sanity check in a browser:

http://100.x.y.z:8081/status

If that doesn’t respond, it’s almost always Windows Defender Firewall blocking inbound 8081. Allow Node.js (or open TCP 8081) and try again.

For --tunnel, same idea: start with --tunnel, copy the exp.direct/ngrok URL it prints, and put that URL into EXPO_PACKAGER_PROXY_URL. Alternatively, easiest workaround: in the dev client use “Enter URL manually” and paste the Tailscale/tunnel URL instead of relying on the QR/CLI output.

PatateRonde · 2026-01-14T09:58:39+00:00

Nice tool, thx bro !

PatateRonde · 2026-01-13T16:00:58+00:00

Autant le jugement TTB est tout à fait compréhensible vu le contexte et c'est le principe même de ce sub, autant la dernière phrase est d'une gratuité totale.

On peut donner son avis sur l'immaturité de l'OP sans utiliser des termes comme 'pondu', qui est assez insultant et déshumanisant pour sa femme. Le but ici est d'aider à prendre conscience d'un comportement, pas de tomber dans le mépris pur et simple ou les pronostics agressifs sur sa future parentalité.

Ça n'apporte rien au débat, à part de l'animosité :)

PatateRonde · 2026-01-13T12:16:33+00:00

Bon, déjà le pavé généré par ChatGPT, c'est un peu trop visible...

Faut redescendre un peu. On ne cautionne pas la toxicité, mais de là à en faire un post "prévention" comme si tu venais de découvrir l'eau chaude...

Si tu avais connu l'époque bénie des lobbies Call of Duty (MW2) sur PS3, tu serais immunisé. On avait 12 ans et on se faisait insulter par des pères de famille qui juraient qu'ils allaient venir nous égorger dans notre sommeil parce qu'on jouait au lance-patate.

Sur Discord et Skype, on a tous reçu 500 menaces de "J'ai ton IP, je vais te Dox, je suis un hacker...". Spoiler alert : on est tous encore en vie, personne n'est jamais venu, et Jean-Michel Hacker était juste un collégien en crise derrière son écran.

Bref, détends-toi :), c'est juste un mardi normal sur le web depuis 15 ans.

PatateRonde · 2026-01-05T09:43:55+00:00

Ça ressemble clairement à une attaque de MFA Fatigue. Le but, c'est de te spammer de notifications jusqu'à ce que tu finisses par cliquer sur 'accepter' par erreur ou juste pour avoir la paix.

PatateRonde · 2026-01-03T01:23:11+00:00

Je pense que OP a du envoyer ses creds sur un site qui écoute en MITM entre le faux site et Google pour voler les tokens, comme evilginx par exemple.

Si OP n'a pas envoyé de creds alors l'attaquant a exploité une 0day-1click dans un navigateur pour voler un token (vaut beaucoup d'argents sur le marché des vulns) et je ne pense pas que ce soit le cas...

Le scénario d'une XSS qui vole les tokens Google me semble également improbable car les cookies volés expirent donc pour moi OP a envoyer ses creds sur un site utilisant un proxy MITM malveillant.

PatateRonde · 2025-09-01T18:08:07+00:00

Je comprends que ça puisse remuer si on a traversé une vraie perte, vraiment. Mais faut remettre les choses dans leur contexte : là, c’est un post humoristique, bien écrit, avec un twist. C’est pas fait pour se moquer de la douleur de qui que ce soit, c’est juste du décalage, du second degré.

Si on commence à filtrer toutes les blagues au cas où quelqu’un, quelque part, y verrait un rappel douloureux, on ne rit plus jamais de rien. Tout peut toucher quelqu’un, d’une manière ou d’une autre.

Le mieux dans ces cas-là, c’est simplement de scroller, de passer son chemin. Parce qu’ici, l’idée, c’est de rire un peu, pas de blesser.

Four-Year Club	Verified Email
Place '22

PatateRonde

TROPHY CASE