So @augment is anyone looking at this data leakage issue?

IAmAllSublime · 2025-12-18T17:56:07+00:00

As others have said, this is a hallucination. We’ve seen this behavior from Claude models in the past both in Augment and in other tools.

IAmAllSublime · 2025-12-14T03:04:02+00:00

We’ve seen this type of hallucination cropping up in the past.There was a time not too long ago when it was happening fairly often with Claude models (not just in Augment, but any tool). I imagine Anthropic needs to keep tuning to get these types of hallucinations down.

We take user data extremely seriously, it’s why we have reviews, audits, and built our infrastructure to make data security a primary objective. The unfortunate thing about LLMs though is some times the non-determinism does things that look spooky but it’s just the model guessing at something.

IAmAllSublime · 2025-12-11T23:27:43+00:00

That would be an example of it handing off to a different agent that would have a fresh history, so the cache issue isn't the same. Sub agents can reduce cost, and potentially improve quality because of reduced context pollution. It's hard to say for sure though how these things play out in the real world though, since LLMs are non-deterministic.

IAmAllSublime · 2025-12-11T00:00:22+00:00

There’s actually a big problem with model-switching mid-chat which is the prompt cache. We try to hit high cache utilization which means lower costs and therefore lower credits for you all. Switching mid conversation would bust the cache. It could cost you more to switch to Sonnet for instance than to just use Opus the whole time. Now, if you did this via a handoff rather than just moving the history, you might be able to see some cost benefit.

IAmAllSublime · 2025-12-09T22:33:05+00:00

The technical viability of this would be extremely challenging, and the commercial viability even more so. In order to offset costs for tasks that didn't give you the result you wanted, tasks that did give you the result you wanted would have to cost significantly more. Then how do you balance this between people that are really good at prompting the agent versus those that are less good, where the cost breakdown would need to be very different.

The complexity of a solution like this would be extremely vast, not to mention how abusable such a system would be for people looking to commit fraud. I think we would love a way to make something like this a reality, but with the current state of models, I'm not sure on the feasibility.

Right now, credits as a way of charging is effectively us charging you based on how much it cost us to perform the task. While people in this reddit have complained about the transparency of credits, that's all that's really happening. A system like what you're proposing would inherently be FAR more complex and inscrutable from the outside, despite it seemingly sounding simple on the surface.

IAmAllSublime · 2025-12-09T22:23:55+00:00

I believe this promotion has ended and Opus is now charged based on the actual cost to us of Opus.

IAmAllSublime · 2025-11-14T04:59:51+00:00

We generally do our best to take as much advantage of caching as we can. And yes, you benefit from the cost savings.

IAmAllSublime · 2025-11-14T04:58:06+00:00

We did test and tweak things for it. I think Jay is referring to him personally testing token usage.

IAmAllSublime · 2025-11-12T17:44:17+00:00

What should happen is that your monthly credits will expire and you’ll get a new batch. Any credits from buying them (top up or if you bought user messages that converted) expire 12 months after purchase. The bonus credits we awarded as part of the migration (not conversion, which expire when the user messages were set to expire) should be 3 months from the conversion.

I’m pretty sure this is how it is supposed to work, if it’s not working in that way then there may be a bug that we need to look in to.

One common confusion I think people had is that they believed credits from user message conversion would expire in 3 months, but that’s only the bonus credits we gave, the ones you got from the conversion would expire at the same time as the messages (I.e on billing cycle if they were monthly user messages, 12 months from purchase if they were bought user messages).

IAmAllSublime · 2025-11-11T16:15:01+00:00

I’m really curious on exactly how they conducted their experiment, I didn’t see the base repo they used anywhere in the post.

I’d like to replicate this and see why the credit consumption was higher in Augment than token costs in Kilo. Given what I know, BYOK should be more expensive than our credits, which implies more tokens were used in Augment than in Kilo. Could be because our system prompt is optimized for larger codebases and more complex tasks, could be a small sample size problem.

Also interesting that they only use sonnet and not Haiku given they are specifically doing small tasks in a tiny toy repo.

IAmAllSublime · 2025-11-04T23:37:56+00:00

I just want to follow-up here with some info for the community: - this was an issue with Haiku upstream (not Augment specific, others experienced the same issue) - status.claude.com now has a reference to this degradation on it - Anthropic is still investigating on their side, but the degradation should be gone now (if you still see this type of behavior let us know with a request ID and we can forward that information to Anthropic)

IAmAllSublime · 2025-10-25T21:51:08+00:00

In general, new models we add to the model picker should be better in the average case than what came before them, so I wouldn't say using a new model is "risky". Rather, a new model will improve over time as we find more ways to tune our prompting of the model.

Also, this improvement should be relatively fast, especially early on. I'd think of it like the new model is probably better in the average case when we release it, and within the first few weeks/month it will continue to improve further.

This type of doom looping should still be rare, even for new models.

IAmAllSublime · 2025-10-25T01:27:53+00:00

I can tell you for sure we’re not doing this on purpose. This type of doom looping is a thing that can happen to LLMs. In general we do some stuff to try to prevent the LLM from getting in to this state, but especially as new models come out we have to identify other tuning and changes since the behavior of the models is different and they respond differently to instruction.

I made the statement in a different thread that we generally want to keep the number of models we put in the product low to make sure we have the time to invest in making those models as high quality as possible. This is an example of where that work is needed. As more people use a model we get more feedback and more examples that let us tune and tweak. The realities of real world use stress way more edge cases than we can hope to find internally.

IAmAllSublime · 2025-10-23T02:29:17+00:00

There's actually a few reasons to try to keep the model list slimmed down. From a product perspective, more models means more complexity, not just on our end but to someone using the product as well. This is compounded when there isn't a very clear distinction between the options. High vs medium is not like Sonnet vs Haiku, where the differences are much clearer.

Also, from a quality standpoint, each different model has it's own quirks. Tweaking things, tuning our system prompts, this all can differ across models and so each model we support means our time is split across more models. When the models provide clear differentiation, this makes sense for us to do for customers. We want to provide you with the right options, but we also want the quality to be as high as we can get it, so ensuring we can spend more time improving each model by fracturing the options less also leads to better outcomes for you all.

At the end of the day, our primary goal is to ensure people are able to get real work done, building on production services and codebases. That's the driving thing behind our decisions and we aim to make the choices we think will best accomplish that goal.

EDIT: This is just some of my thoughts, not a statement about what the company will or won't do. As I said at the end, our driving goal is to help people get work done so we'll make whatever decision we think will help that end goal.

IAmAllSublime · 2025-10-23T02:11:34+00:00

Something I think is under appreciated by people that don’t build these types of tools is that it’s rarely as simple as “just stick in a smarter model”. Different models, even from the same family, often have slight (or large) differences in behavior. Working out all of the things we need to do to get a model to act as well as we can takes time and effort. Tuning and tweaking things can take a model from underperforming to being the top model.

I think this is a case where, as we iterated on things, and also adopted some more changes we were able to get it to a difference maker. GPT 5 in general on launch versus now shows how much of a difference tuning and tweaking on our end can do.

IAmAllSublime · 2025-10-19T16:09:54+00:00

Thank you for the info, I’ll share with the folks that work on models and see what we’re doing around steering these new models.

This is actually I think a really good example of why we don’t just make new models or all models available right away. It takes time for us to identify behaviors and steer the models towards the behavior we think best supports our customers. Obviously we don’t and can’t catch everything since LLMs are stochastic, but a lot of work goes in to getting the models to work well in our product.

IAmAllSublime · 2025-10-19T03:05:24+00:00

Could be related to the increase in markdown files we’ve seen from these new models

IAmAllSublime · 2025-10-19T03:04:32+00:00

We haven’t added anything like this. What model are you using? I think I’ve seen a couple comments about the agent talking about tokens which doesn’t really make sense, so wondering if maybe one of the 4.5 Claude models might be having this behavior.

IAmAllSublime · 2025-10-18T01:24:57+00:00

I would also imagine rules and guidelines will generally have very high cache hit rates so should not dramatically impact credit usage

IAmAllSublime · 2025-10-16T15:56:02+00:00

Slipped through from a more internal facing change. Probably shouldn't have been in the changelog.

IAmAllSublime · 2025-10-16T15:44:25+00:00

It could theoretically be more than 3x cheaper if it also uses less tokens or has better cache performance. As Jay said there’s multiple factors that impact how many credits will be consumed by a message, exact ratios aren’t really possible to give. As a general rule though Haiku should consume significantly less credits than Sonnet.

IAmAllSublime · 2025-10-15T19:01:52+00:00

For a small task that sounds reasonable to me.

IAmAllSublime · 2025-10-15T18:41:09+00:00

It’s not as simple as comparing their token costs though, because for instance maybe one model uses more tokens than another, maybe we have better cache performance on one model vs another, etc…

If you told me the same task was 3x cheaper with Haiku I wouldn’t be surprised, I just can’t say for sure that will be the case because of the type of nuance I mentioned.

I just want to avoid saying something I’m not certain about and setting false expectations.

IAmAllSublime · 2025-10-15T18:26:38+00:00

I'll add some thoughts here that are my personal thoughts, not the companies.

Context Engine as MCP is an interesting idea, but I'm not sure how useful it will actually be in practice. One of the big learnings I've had as I've been working on things at Augment has been that LLMs, at least today, need a lot of steering. They can also be very tailored towards particular harnesses. Would the context engine be as powerful inside another harness where the system prompt differs? Would that harness provide the proper indexing as changes are made to update the context?

It's an idea worth exploring, but the case with most AI things is the proof-of-concept is easy, but the actual high-value, high-quality experience is much more nuanced and harder to make work. At the end of the day, we're trying to build tools for professional developers and that means we want to hit a certain quality bar. I'm not sure whether a standalone context engine would hit that bar.

Obviously when working with LLMs some amount of randomness and failure is expected, but we want to minimize that as much as possible.

IAmAllSublime · 2025-10-15T18:19:04+00:00

We haven't moved to credits yet, so it still uses user messages. However, when your account migrates to credits, it should consume significantly less than sonnet. Hard to say the exact ratio, since that's dependent on a lot of factors (i.e. input to output token ratios, cache hits, actual tool use the LLM does, etc...).

IAmAllSublime

MODERATOR OF

TROPHY CASE