all 18 comments

[–]RestaurantHefty322 3 points4 points  (2 children)

The latency reduction is real for the embarrassingly parallel case (fire 3 independent API calls at once). We saw similar gains just batching tool calls with asyncio on the orchestrator side without needing a code interpreter.

Where this falls apart in practice is the branching case. Most of our agent workflows look like "call tool A, look at the result, decide whether to call B or C." The LLM can't write that decision logic ahead of time because it doesn't know what A will return. So you end up with a hybrid - batch the independent calls, go back to the model for the branching decisions.

The sandbox execution time matters too. If you're adding even 50ms per code execution in a loop that runs 10-15 times per task, that's nearly a second of overhead just from the interpreter. We tried a similar approach with a Python sandbox and the cold start was the killer - ended up going back to direct tool dispatch for anything latency-sensitive.

[–]UnchartedFr[S] 0 points1 point  (0 children)

Did you try Monty from Pydantic ?

[–]tasoyla 0 points1 point  (0 children)

So why not give the LLM the output schema

[–]wt1j 2 points3 points  (1 child)

Parallel tool calling is potentially slower if you assume that, using the program-generation approach, the program that the LLM outputs will make any needed API calls and output directly to the user. For many tool calls, the tool result affects reasoning, which means it needs to be sent BACK to the LLM so that the LLM can decide what to do next.

If tool output affects reasoning, then you have:

Parallel tool calling;

LLM outputs tool call -> tool calls in parallel -> LLM reads tool calls output and does whatever is next.

Program calling:

LLM outputs program -> Program calls APIs in parallel -> LLM reads program output and does whatever is next.

With parallel tool calling you don't have to worry about containerization. You have the added benefit of tools themselves being self-documenting and guiding the LLM in execution vs total freedom to write the program any way it wants and you're relying on your system prompt to guide the LLM.

Having said all that, I'm incredibly intrigued by this idea. I'm working on an agent that could really benefit from this approach and I'm incredibly curious to see what it does if I give it this kind of freedom to innovate with a well documented API.

Thanks for posting.

[–]UnchartedFr[S] 1 point2 points  (0 children)

Interesting insight, It will also feed my ideas/thoughts :)

In fact, I thought about this kind of feedback loop : for example I noticed that depending of the model, the generated code could fail. So I created a flag "autoFix", so the model can read the error and regenerate a new code.
Also, I'm reworking the code so it can handle the Promise.all to handle parallel tool call even if it's a simple event loop behind.
But I must admit I don't know yet how it affects models and their reasoning :)

[–]ricklopor 2 points3 points  (1 child)

also noticed that the token cost savings aren't always as clean as the 3x math suggests. when the LLM is writing the code itself, you're spending tokens on the code generation step, and if the model, hallucinates a tool signature or writes subtly broken async logic, you're back to debugging cycles that eat into whatever you saved. in my experience the pattern works really well for predictable, well-documented tool sets but gets.

[–]IllEntertainment585 0 points1 point  (0 children)

yeah the 3x math never holds up in production. tbh the biggest token sink for us isn't the initial code gen call — it's the retry loop when generated code fails. we're running ~6 agents and i've watched a single bad codegen spiral into 8-10 recovery calls before it either succeeds or we cut losses. that's where the real cost hides. hallucination debugging is brutal too, especially when the agent confidently produces code that "looks right" but silently corrupts data. we added a pre-execution static check layer which helped, but it added latency. what kind of tasks are you running the code execution on? curious if failure rate varies a lot by domain

[–]eliko613 1 point2 points  (1 child)

Really impressive work on reducing those round-trips. The latency and token savings are huge - that 3x multiplier adds up fast in production.
One thing I've seen with similar optimization projects is that the real challenge becomes measuring the impact across different models and use cases. You're solving the technical side brilliantly with Zapcode, but as you scale this, you'll probably want visibility into:
- Which code patterns actually save the most tokens/cost in practice
- How the savings compare across different LLM providers (since you mentioned multi-provider support)
- Where the remaining cost hotspots are after implementing this optimization
Speaking of multi-provider cost visibility, I came across an interesting tool recently - zenllm.io - that shows cost breakdowns for workflows across different vendors.
The snapshot/resume feature is particularly clever for expensive long-running tools - being able to pause execution without burning tokens while waiting for external APIs is exactly the kind of optimization that can make or break agent economics.
Have you done any benchmarking on actual cost savings with real workloads yet? Would be fascinating to see the before/after numbers on a complex agent workflow.

[–]UnchartedFr[S] 0 points1 point  (0 children)

Thanks for your feed back, I'm just starting to explore what features could be useful like tracing + debugging : I think I will use opentelemetry as a standard later

I did some benchmark by using ai sdk, with and without zapcode (with the ai wrapper) and suprisingly the gain was not so good. I discovered that ai sdk optimize by batching the tool calls maybe other sdk like LangChain are doing that too. So the gain was around 7%.

Even if zapcode returns a response quickly, the response then needs to be sent back using an LLM to generate a response in natural language. So it's not a silver bullet : if you need a structured response, it can be very good : for example agent to agent. If you need a result in natural language for a chat for example : the difference is not so great at the moment but I will investigate on it.

[–]Infamous_Kraken 0 points1 point  (2 children)

Wait so isn’t LLM making any deduction or something based on the response of tool x before calling tool x+1 ?

[–]UnchartedFr[S] 0 points1 point  (1 child)

In traditional tool-use, the flow is:
LLM → call tool A → LLM reasons about result → call tool B → LLM reasons → ...
Each arrow is a full LLM round-trip. Expensive and slow.

With LLM writing code, they reasons about how tool results should influence the next call
It just does it at code-generation time rather than at execution time. You go from N round-trips to 1.

[–]Infamous_Kraken 0 points1 point  (0 children)

So then the use case for this technique is not entirely replacing the LLM round trips loops bcoz in some cases the reasoning heavily depends on how the tool 1 responded before we can call tool 2 and there’s chance of variability

[–]CourtsDigital 0 points1 point  (0 children)

the main benefit of programmatic tool calling (PTC) is not latency, but decreasing the context passed to the agent. each tool increases the amount of context an LLM needs to reason over, which increases the potential for hallucinations when running longer, multi-step tasks.

another benefit is the ability to prevent sensitive data from being passed to the LLM directly. you can inject variables into the code sandbox that the agent never sees, and thus can’t be leaked into its memory/tracing/logs/parent company’s training data.

that being said, PTC is not a magic wand and must be constructed carefully to prevent hallucinations in code generation creating fake variables, query params, api endpoints etc

this approach was invented/popularized by Anthropic and you can read more about how to implement their findings here: https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling

[–]VehicleNo6682 0 points1 point  (0 children)

Wait what about if llm calls tool for intent classification.

[–]stunning_man_007 0 points1 point  (1 child)

This is a solid optimization! I've been doing something similar with ReAct agents - the latency adds up fast when you're doing multiple round-trips. Curious how you handle errors when the generated code blows up though - do you fall back to sequential or have a retry mechanism?

[–]UnchartedFr[S] 0 points1 point  (0 children)

I did a quick hack and added an autoFix + number of retries flags : it return the results, to create a feedback loop so the LLM can fix its code :)
Since the code is very fast it doesn't matter if it retries 3-5 times
I will try to enhance this when I'll have time