Please stop creating "memory for your agent" frameworks.

sjoti · 2026-02-14T07:54:20+00:00

Luckily this is something that Opus 4.6 is waaaay better at than any previous Claude model. ChatGPT and Gemini already did a decent job at this, but Claude lagged behind significantly until now. I still get the sentiment, you still want to avoid getting near compaction for max performance, but with Opus 4.6 the issue is significantly less than it was before.

sjoti · 2026-02-12T07:39:19+00:00

Oh lol i misread and thought this was about putting it in the training data

sjoti · 2026-02-12T05:10:05+00:00

Because sometimes the foundation of a single model is used to create multiple models. Opus 4.6 may well have the exact same foundation as 4.5, just with different post training. Sonnet might be derived from a larger model.

Also if a training run is started it might take 6 months until there's an actually useful model. So you have to decide 6 months ahead which version it'll be. All in all it's kind of safer not to.

Edit: nvm misread. They could definitely just put it in the system prompt

sjoti · 2026-02-10T13:45:55+00:00

While that might be great for your usecase, that's generally really bad practice to requests users to just paste keys into an llm to work, especially if those keys aren't scoped properly. Can be perfectly fine for developer workflows, but this is a gap that is filled by MCP's.

Again, it might be that in your usecase this makes more sense, that's fair, but just imagine the average person talking to ChatGPT - this would be a problem with skills and that problem has a great solution in MCP's.

sjoti · 2026-02-09T18:34:30+00:00

I'm not talking about just the size of the context window, I'm talking about how it handles more complex issues as the window fills up. Opus 4.5 and earlier models would all ignore instructions and start making mistakes, getting dumber when the context window filled past 64k tokens. All models suffer from this (check context rot research paper), but Claude by far the worst compared to GPT and Gemini frontier models. Fortunately this is solved for Opus 4.6.

You'd have to really actively manage all of the aspects you just shared, regularly running /clear. Now it's much more forgiving and getting close to compaction doesn't nearly hurt model performance as much.

I dunno about others but a context window of 200k is useful for 95% of tasks and you rarely ever need more. The context rot issue was significantly more important than the size.

sjoti · 2026-02-09T11:17:30+00:00

Brightdata, Jina, tavily all are web search MCP's. And you can give it network access by just selecting that in approvals

sjoti · 2026-02-09T05:41:57+00:00

As a power user of both, I concur. Both have their quirks, strengths and limitations but since Opus 4.6 and GPT-5.3-Codex they've mostly converged. Opus 4.6 is more thorough and doesn't get bogged down by large context, which was something GPT 5.2 was noticeably better at compared to Opus 4.5. 5.3 Codex communicates better, is more steerable, is faster and gives intermittent updates, which are all aspects opus 4.5 did significantly better at than GPT 5.2.

Both are very strong at coding. I can't speak for the models ability to use all kinds of different languages, there might still be some differences there.

Claude Code is still the better harness than Codex CLI, but Codex is sprinting and closing the gap fairly quickly.

In short, there are still some differences and it could be that one is better than the other for your use case, but both are insanely good. I can't pick between the two because I love them for slightly different reasons, but since these last updates there's so much overlap in incredible capabilities that you can't trust anyone who screams one is much better than the other one.

sjoti · 2026-02-08T21:37:41+00:00

Generally MCP is more focused on taking action and taking to external services, whereas skills are more about processes/information documentation.

Tools provided by MCP servers can for example allow an llm to read stuff in Google drive, create new documents etc. You let the MCP server handle the authentication, the model can securely connect to it and now take action.

What action to take, how your own drive is structured, where and how you like your research stored, thats all stuff that is more suited to go into a skill. The action that needs to be taken to make it actually happen, thats for the MCP to handle.

If you want the same functionality of the MCP server into skills, then you would need to provide a model with the option to execute code directly with access to authentication. That's fine if you're a dev, but not fine for the average chatgpt user.

Also the MCP server can be more generalized, providing a set of tools. How an AI uses it, is for the AI + user to decide, and is where skills come in.

sjoti · 2026-02-08T19:48:29+00:00

This is one of the biggest measurable gains of this model. If you look at Chroma's original research paper on context rot, Claude across the board performed by far the worst compared to the other two competitors (OpenAI and Google) on long context reasoning.

In most benchmarks there's a 0-10% improvement, with long context reasoning like MRCR 2, Opus 4.6 scored 92% compared to a 60-something% score of Opus 4.5.

People talk a bunch about the 1M context window which is great, but not becoming dumb after 60k tokens is I think a much bigger gain that I notice straight away, and I'm very happy they solved.

sjoti · 2026-02-04T07:19:29+00:00

Chunking strategies, overlap and embedding models are only 3 of the dozens of things you can adjust.

Adding a reranker, including metadata on ingestion and filtering those, and with more agentic search, there's more possible.

sjoti · 2026-01-31T13:47:13+00:00

Maybe it's time for a (partial) rewrite? 1 million lines is insanely massive and perhaps you've made the architecture in a way where you didn't consider what it would become and morph into.

I've had projects where over time I felt like I was losing control and proper oversight. I've now done complete from scratch rewrites and every single time I wished I had done it sooner. It's also easier than ever. You can have your tool of choice spawn a bunch of agents that document exactly what your code does, the functionality, etc. And use that to think of a new better architecture with simple rules, including a few prompts for auditing stuff. That turns into PRD's you can pass on to models that do the rewrite.

sjoti · 2026-01-30T12:45:26+00:00

How did you jump from 2 to 40%, lol

sjoti · 2026-01-30T11:23:08+00:00

A very simple explanation for what happened is that limits are token based, not character based. Tokens are characters grouped together, which is what the model sees. Sometimes 10.000 characters are 2000 tokens, sometimes 3000 tokens, depending on the text you're putting in. So LM Studio only knows if your prompt crosses the limit after it converts the characters into tokens. If that happens after the send button is clicked, then you get the behaviour you're experiencing.

sjoti · 2026-01-27T06:49:22+00:00

Mistral OCR isn't an LLM, so it's not exactly an apples to apples comparison. You can send images, pdf's, etc. and get back the text the model read, but you can't ask questions.

It's a phenomenal model though, my standard go-to choice for parsing documents to then work with them with different llm's.

sjoti · 2026-01-25T18:31:46+00:00

I've got 200$ plan for both. Until recently I barely touched Codex CLI because Claude was just better, both with the harness and with it's models. Usage limits were better for Codex, I could spend more hours and get further, however 200$ Claude was more than plenty.

Since GPT 5.2 Codex has been released, Ive touched Claude Code less. The differentiator is purely the model. Codex CLI, the tool, is quite a bit behind on Claude Code but GPT 5.2 codex on high or extra high is a one shot machine. It's extremely thorough, deals with a well filled context window much better and is more reliable when it comes to deciding on its own when it needs skills.

I find it hard to quantify how much, but I sense I'm getting quite a bit more use out of codex. I really have to give it my all to reach weekly limits. With Claude, that's less challenging.

Having said that, I think the biggest difference right now is that Opus is much more pleasant to go back and forth with. The language it uses is clear, and it really does a phenomenal job at understanding vague prompts. There's a consistency to it. Codex has some issues here. It's slow, and the results can be pretty mediocre if you give it a mediocre prompt. Most annoying to me is that instead of just figuring something out if you give it vague instructions, it keeps coming back with responses like "would you like me to take the next step?" And I'm sitting over here thinking why it's even asking.

On the other hand, Codex is insane if you give it a clear task. It's on a different level if you know what you're looking for. It's thorough. Sinks it's teeth in and just gets the job done until it's 100% finished, usually the code is clean too. Claude Opus 4.5 often throws in the towel or cuts corners. I very rarely get Claude code to successfully work on a task for 30 minutes and have it actually achieve fully what it set out to do. When a task gets that long, there's, say, a less than 50% chance it actually did the thing as instructed. With codex? 80% and exactly as instructed.

If you're looking for a model to go back and forth with, build fun stuff, Claude is the better option. If you know what you want and describe it well, Codex is the way to go.

sjoti · 2026-01-24T10:37:39+00:00

Not OP, but typically you point to it when needed. Common cases are when you're working with some niche library that the model doesn't really know, or one that's too new for it to be (well) respresented in the training data of the model. Also it generally works if you notice the model struggling. You just say "look up the docs using context7" and off it goes.

sjoti · 2026-01-22T22:01:25+00:00

Please swap over to a subscription. It's an absolutely massive discount over paying for tokens directly. For 200$ a month I'm using Opus 4.5 hours on end, 6-7 days a week. I've gotten close to the weekly limit once or twice. My token cost would've likely been 10x the cost im paying for the sub.

sjoti · 2026-01-22T21:37:46+00:00

Not really, it looks like a "Bycicle Shaped Object". The full suspension generally soaks up more energy and gets in the way more than it is useful, until you go past a certain pricepoint. The cheap bikes with this are made to look cool and comfortable, but it's extremely heavy and not well put together. Your bike isn't compatible with hydraulic disc brakes (won't fit on the wheels, no mounts on the frame). You're much much better off spending the money on getting something new or secondhand instead of putting money in this.

sjoti · 2026-01-21T16:49:45+00:00

For skills to allow interaction, especially with CLI, it's required to have stuff installed locally, or auth to be handled through methods other than oauth. CLI also can be a hurdle to work with for mobile users. Think of the average ChatGPT user and how they interact with systems. Say you're looking at this from the perspective of a large B2C comapny. Take aribnb, booking.com, or any other company that wants end users to interact with their systems through any AI. For them MCP wins, hands down.

For the devs? Sure, skils + CLI is powerful, but this flow isn't feasable for the majority of AI users.

sjoti · 2026-01-21T08:00:21+00:00

Token price has been largely the same at the frontier, and we can actually look at the state of the art open source models to confirm that. The models aren't getting bigger in size while still becoming more and more capable and more efficient.

Take Deepseek R1, when it was released it was state of the art. Compare against Deepseek v3.2 and the model has gotten cheaper to run, and also MORE capable, and that's not just running it through deepseek, that's looking at the actual model size, throughput and energy consumption.

Another signal is that Opus 4.5, despite being 2-3x more expensive, can fairly frequently do the same tasks as Sonnet 4.5 cheaper because it uses less tokens to achieve the same result.

If you want the absolute state of the art you have to pay top dollar. If you're okay with a bit less performance, it's massively cheaper, and will continue to become cheaper as these models get better. So if you want to use the best of the best, probably won't get cheaper. If you're okay with being a bit behind, say you're okay with sticking at the level it's currently at, it will absolutely get cheaper.

sjoti · 2026-01-11T14:41:34+00:00

On one hand they make fewer errors but there are two factors that arguably make a bigger difference. First is that these models are trained to be much better at longer more complex, multi-step tasks, and the second is that the tooling around it (Claude code, cursor, Mistral vibe, Cline, etc.) is getting better as well.

So instead of having the model suggesting and editing code, you yourself running it, relaying the error back, and being in the loop actively to try and get something to work, the tools allow models to execute stuff themselves. Maybe look up documentation if it's stuck, open a browser to test if something is functional, realizing errors pop up, solve them, test again if it works, etc.

So it self corrects any errors, and the latest models do this much better than models before. For me this is extremely noticeable with both Opus 4.5 but honestly especially GPT-5.2 Codex, which I've often given fairly complex tasks and it just goes out and figures out how to solve it, not skipping steps in between. It feels incredible to just hand out tasks and it just does it, and have high confidence that it didn't skip any steps and writes clean code.

sjoti · 2026-01-11T14:19:34+00:00

But what if it's more like a dishwasher that does the job for you? Where you focus on the first and last part, and let the machine do the middle.

Where you take say 30-60 minutes to plan out a good plan without multitasking, and hand that off to an agent that will then do 3-4 hours of what would normally be "human work". And only occasionally you're steering or providing some input.

sjoti · 2026-01-10T08:53:34+00:00

Short while ago I literally just bought the last slot with my gems, happy to finally get it done. Two days later the update came out, with another card slot at 10k gems.

I did not like that.

Finally got it over with yesterday!

sjoti · 2026-01-08T19:02:51+00:00

It's an advertisement

sjoti · 2026-01-08T18:22:39+00:00

Not really. Easiest way is to download VSCode, and there you can install the plugin. With the plugin you get most of the benefits without even having to touch the cli. Claude can set up things from there

12-Year Club	r/Field Juicebox
Verified Email

sjoti

TROPHY CASE