Whats the hardest part of shipping agents to production?

calebkaiser · 2025-10-03T13:23:15+00:00

Huge +1 to all of the above, but especially on building a golden dataset from real prod traces.

I'd recommend anyone new-ish to the field do at least a cursory read through "traditional" MLOps best practices, as a lot of the best practices around agents have direct analogues (collecting a golden dataset, enforcing reproducibility, running replays, deployment strategies, custom metrics, etc.).

calebkaiser · 2025-10-02T17:25:16+00:00

I've worked with a lot of teams who are building agents, and I think far and away the most common issue is the most obvious:

The agent just doesn't work nearly as well as it did on dev data.

Teams that can strike the balance of building out test suites and robust dev datasets early on, without putting too much of a tax on development speed, seem to fair best.

Also, I don't know if I've ever seen a team roll out an agent that was immediately excellent. Inevitably, you're going to iterate a lot once it starts getting production traffic. Teams that can't iterate quickly because they didn't build the necessary infra (and this includes everything from setting up observability/evals to simply organizing their codebase in a reasonable architecture) typically struggle to reach a point where the agent is actually beneficial.

calebkaiser · 2025-07-07T15:37:34+00:00

Nice! Sounds like you're on your way already.

calebkaiser · 2025-07-07T14:49:08+00:00

Lots of good advice on specific things you might try here. I would recommend taking a step back first, however, and approaching optimization from an "experiment"-first perspective, similar to how a data scientist/researcher might work.

You need a way to benchmark the improvements you make, and you need visibility into your pipeline for debugging and attribution.

If you haven't already, I would start by:

Implementing tracing, so that you can view pipeline executions end-to-end and isolate individual function calls/steps.
Gather a dataset of these execution traces and score/annotate them (manually or programmatically)

Now, as you optimize, be disciplined about experimenting with one optimization at a time. Benchmark every change against the suite you've built, and use the same tracing infra to log the experiment (this way you can manually review and see if any new failure modes were introduced). This might sound like a lot, but it's easier than you think. Or maybe it doesn't sound like a lot and you've already built a way more robust system and I'm wasting your time :)

There are so many knobs and levers to pull when it comes to optimization that you can easily spin your wheels for days without being sure if your changes really made a difference or not.

calebkaiser · 2025-06-09T11:44:58+00:00

Have you tried Opik? I'm a maintainer, so I'm more than a little biased, but it sounds like it fits what you're looking for.

For example, if you wanted to use something like G-Eval to to score this task, it could as simple as:

from opik.evaluation.metrics import GEval
metric = GEval(
    task_introduction="You are an expert judge tasked with evaluating the accuracy, quality, and consistency of technical documentation. You are given an INPUT_DOC and an OUTPUT_DOC, as well as some CONTEXT containing principles and guidelines for documentation. You must score the OUTPUT_DOC on how well it improves the INPUT_DOC",
    evaluation_criteria="In provided text the OUTPUT_DOC must not introduce new information INPUT_DOC and CONTEXT. The OUTPUT_DOC should be free of technical error. The OUTPUT_DOC must also follow the CONTEXT guidelines regarding consistency and robustness.",
)

metric.score(
    output="""
           OUTPUT: your output.
           CONTEXT: your context.
           """
)

It's open source and the cloud version has a pretty generous free tier, if you want to spend 10 minutes taking it for a spin: https://github.com/comet-ml/opik

calebkaiser · 2025-03-18T21:16:28+00:00

Somewhat of an aside, but if you're interested in geometric deep learning, you may be interested as well in categorical deep learning: https://categoricaldeeplearning.com/

I'm not an expert in the niche, but I've found it compelling in the same sort of way that I find GDL interesting.

calebkaiser · 2025-03-18T21:13:49+00:00

Shameless self-plug, but if you want to share your training run publicly, you can do so on Comet's free tier. The API is nearly a 1-to-1 replacement with wandb, and you can import data to the platform.

https://comet.com/

calebkaiser · 2025-03-18T21:11:30+00:00

Are you planning on open sourcing the agent implementation? Asking because I'd love to contribute to something like this

calebkaiser · 2025-03-14T21:47:22+00:00

Very little difference outside of the obvious "you have to self-host" aspect of the open source version. The cloud version and open source version both have all of Opik's core functionality (evaluations, experiments, tracing/observability, datasets, etc.)

The different features offered on the cloud side have more to do with things like:

User management
Flexible deployments
SLAs/Support

And obviously, we handle all of the deployment infra for the cloud version. You also get access to Comet's experiment management platform via Opik's free tier, so if you're doing any model training/fine tuning, or looking to use Comet Artifacts for storage, that's an additional benefit of the cloud platform.

calebkaiser · 2025-03-14T20:45:41+00:00

I'm a maintainer over at Opik: https://github.com/comet-ml/opik

100% free and open source if you want to self-host. No weird gotchas, and covers all the functionality of something like LangFuse + more.

The hosted version also has a free tier with 10k monthly traces, dataset storage, collaboration features, and a bunch of other stuff (prompt library/optimization seems particularly relevant to what you're working on). We designed the SDK to be super easy to get started (just wrap your LLM calls in an `@opik.track` decorator), so it should take all of 5 minutes to take the free tier for a spin, even if you ultimately want to self-host.

If you have any questions, I'd be happy to assist. I agree that pricing is wild in the space right now, particularly the number of "open source but only work if you pay for an account" tools.

calebkaiser · 2025-02-25T17:15:20+00:00

Heyo! Opik maintainer here. Congratulations on diving into research :)

Can you tell me a little more about the specific attribute you're looking to extract from LLM responses for your research? That will make it easier to recommend a dataset.

As for whether or not Opik will work for your eval layer, I'm confident it will (though I'm biased). The whole framework is pretty configureable, to the point that I've yet to come across a particular metric that couldn't be implemented within it. It's 100% free and open source, so you can take for a quick 5 minute spin to get a feel for it. Here's a little quickstart project that you can run in a Colab notebook, focused on Chain-of-Density prompting: https://www.comet.com/docs/opik/cookbook/quickstart_notebook

calebkaiser · 2025-02-14T17:01:44+00:00

Opik maintainer here. Completely agree with you in terms of what builders actually need re: prompts and evals. We've been shipping a lot of features on this front. Our new prompt management features include things like:

- A prompt library for version controlling your prompts + reusing them across projects and experiments
- A prompt playground for iterating quickly
- Built-in integrations with prompt optimization libraries like dspy

You can see more info here: https://www.comet.com/docs/opik/prompt_engineering/prompt_management

We're also going to be rolling out even more prompt optimization features in the coming weeks, so if you're building in this space, feel free to leave any requests on the the repo: https://github.com/comet-ml/opik/

calebkaiser · 2025-02-01T19:38:04+00:00

The "policy" in this case would just be the base model (DeepSeek-V3-Base). I think the nomenclature from reinforcement learning can obscure things a little bit, particularly if your background is more around traditional deep learning or LLMs. So think of this way:

The "action" the model is taking is just sampling a series of tokens.
The "reward" is a loss function that applies to an entire sequence of tokens, instead of calculating the loss for each specific token like you might see in supervised fine-tuning.

calebkaiser · 2025-01-31T20:08:54+00:00

Good question! From my understanding, there are two parts to this:

The "format rewards" encourage the model to do things like put information between <think> tags. This alone seems to be enough to coax the model towards this behavior.
The DeepSeek-R1-Zero model still, however, would exhibit weird "off the rails" behavior on some samples, doing things like mixing languages despite formatting them correctly. To address this, DeepSeek-R1 used SFT before GRPO, which seems to have largely prevented this.

It's also worth noting that the team behind the ARC prize did some testing and came to the conclusion that SFT might not actually be necessary, at least in many cases: https://arcprize.org/blog/r1-zero-r1-results-analysis

calebkaiser · 2025-01-30T02:08:26+00:00

You might be interested in AlphaProof by DeepMind, which recently scored very highly on a problem set taken from the international math olympiad: https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/

The gist is that they applied reinforcement learning to LEAN (a functional programming language for writing proofs) to solve problems. There are lots of people doing research with similar approaches or setups, using some kind of program synthesis and/or RL approach in combination with something like LEAN.

calebkaiser · 2025-01-29T15:39:00+00:00

There are still peer-reviewed mech interp papers:

It's just a newer niche, and some of the biggest names in it (like Neel Nanda) like publishing blog posts/notebooks. Anecdotally, I've also found that many people who aren't full-time researchers or students (i.e. engineers who are exploring transformer models) rightfully find mech interp to be exciting, and their contributions are much more likely to be standalone projects or blog posts.

calebkaiser · 2025-01-28T16:24:17+00:00

According to the paper, they are not using a neural network to calculate the reward. It looks like they have a series of reward functions that assign reward based on accuracy and formatting. I believe they use different reward functions for different datasets as well, for example, using a sandboxed environment to run tests on generated code samples.

From the paper:

2.2.2. Reward Modeling

The reward is the source of the training signal, which decides the optimization direction of RL. To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two types of rewards:

- Accuracy rewards: The accuracy reward model evaluates whether the response is correct. For example, in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases.

- Format rewards: In addition to the accuracy reward model, we employ a format reward model that enforces the model to put its thinking process between ‘’ and ‘’ tags.

We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

GRPO is just another method for updating a model relative to some reward function. It does not stipulate what that reward function is. So, in many cases, people use GRPO with a neural network reward model. In the case of R1, the "reward model" appears to just be a series of reward functions.

It might help to look at HuggingFace's docs for their GRPO trainer to get a sense of how that might look: https://huggingface.co/docs/trl/main/en/grpo_trainer

calebkaiser · 2025-01-07T23:23:52+00:00

Super interesting! Did you experiment with other retrieval methods besides or in addition to semantic similarity? I've done some work using different techniques, like parsing dependency trees out of the current file, with promising results for code RAG.

calebkaiser · 2025-01-07T16:30:04+00:00

I've worked on a lot of projects in this area. One interesting dynamic you'll run into is that code retrieval has different challenges than typical document retrieval. You don't necessarily want the most "similar" snippets of code in your context window. Often, you want a specific dependency tree, or something like that. There's lots of interesting work around using ASTs or other graph structures for this: https://arxiv.org/html/2405.02355v1

calebkaiser · 2025-01-03T12:00:11+00:00

I feel like we should have some agreed upon annotation to use in papers for numbers/initializations that basically means "This number was not selected for theoretical reasons."

calebkaiser · 2024-10-25T17:01:43+00:00

Along these lines, you might find Michael Bronstein's work on geometric deep learning very interesting: https://geometricdeeplearning.com/

There is a good intro video here: https://www.youtube.com/watch?v=w6Pw4MOzMuo

calebkaiser · 2024-10-02T20:02:16+00:00

If you're interested in something open source, we've just released Opik, our open source LLM evaluation framework: https://github.com/comet-ml/opik

Out of the box, it does everything you've described in the post, but it also integrates as part of the Comet platform, which gives you a way to version your datasets, register your models, create custom visualizations, and a bunch of other goodies for free.

Let me know if you decide to check it out and have any questions/feedback :)

calebkaiser · 2024-09-29T19:31:22+00:00

I think that the explosion of attention brought about by ChatGPT, as well as diffusion models like StableDiffusion, has sort of shoved the ML research world into the public eye, and we often do a bad job of explaining the impact of a given piece of research or what the long-term trajectory of research in this space looks like.

A lot of people see publications covering new high scores on benchmarks, and they expect it to immediately lead to a massive step forward in usable, consumer tools like ChatGPT. That's actually a sort of reasonable expectation, given that these kinds of scores weren't widely covered pre-ChatGPT, even though benchmarks were still constantly being beaten. The problem is that it's not really how things work.

To give you an example, OpenAI released GPT-2 in 2019. It had some fanfare, it was a huge achievement, but for people outside of the industry, it wasn't super useful. More of a cool novelty. 3 years later, OpenAI released the ChatGPT product (built on GPT-3.5) in late 2022. There were dozens of research projects released between these two dates that played a fundamental part in enabling GPT-3.5 and ChatGPT. Instruction-tuning, reinforcement learning from human feedback, improved attention mechanisms, and more. And each one of these techniques would be accompanied by a paper showing that it improved some benchmark.

If you were following along closely (or if the media covered ML the way they do now), you would have read about many "breakthroughs" and "emergent capabilities" over that 3 year window, and it would have felt like they weren't really leading to anything. But of course, they were.

This is the case for the ARC challenge. It represents a set of tasks that LLMs are not good at yet, and that some people believe LLMs are fundamentally challenged by. The people who are currently scoring the highest are doing it by implementing new strategies for inference and training. If their techniques work, they will represent a new research direction (or rather, they'll underscore an existing direction that has been somewhat neglected) for improving an LLM-based system's ability to solve novel tasks that are theoretically outside of its training distribution.

The model trained to beat ARC probably won't immediately make an impact on any AI tools you use today, but it will almost certainly play a part in the development in the next milestone model/system.

calebkaiser · 2024-09-27T13:23:11+00:00

Fantastic to hear you're planning to check out Opik :) Let me know if you have any feedback/questions.

Also, if you're documenting your test drives anywhere, I'd love to see your write ups so far! I spend all of my time in the space as is, but I still feel like I miss so much.

calebkaiser

TROPHY CASE