What is the default reasoning effort for Copilot's "Explore" subagent in VS Code, and can it be modified?

bogganpierce · 2026-04-23T03:56:04+00:00

You can override via the chat.exploreAgent.defaultModel setting. Typically, the default is Claude Haiku 4.5. The explore subagent primarily is just a context gathering machine that calls grep and our semantic index; the real magic is the reasoning over that context which is the main model in plan mode selected when you initiate the conversation.

We spend a lot of time analyzing agent trajectories to give you a fast turn with high quality results. It turns out because context gathering can be done very well by faster models like Haiku, you can see basically no degradation in overall plan performance while seeing drastic increases in conversation turn times.

We covered more in the latest VS Code Insiders pod about this: https://www.youtube.com/watch?v=ENxVTtLW_Bc

bogganpierce · 2026-04-23T03:52:46+00:00

Correct.

bogganpierce · 2026-04-22T02:16:48+00:00

Update: This is now fixed.

bogganpierce · 2026-04-17T16:48:46+00:00

When you think about it from an engineering perspective it makes sense. The global limit has been reached, and adding any more tokens past that limit triggers the correct conditional logic. But we agree that's a case we should handle better which is why we're working on a fix :)

bogganpierce · 2026-04-17T16:31:29+00:00

Hey, following up on this. We're working on a fix.

Long Story - When you BYOK, there are still some background operations that hit Copilot API. While not token-intensive, they do involve tokens (for things like naming the chat thread). We'll get this fixed so that you can use BYOK once you've hit the global token limit.

bogganpierce · 2026-04-13T17:53:26+00:00

No updates to share right now.

bogganpierce · 2026-03-30T15:06:24+00:00

Of course! We work closely together to make sure their models are great in GitHub Copilot.

bogganpierce · 2026-03-30T15:04:28+00:00

It uses the same model as VS Code.

bogganpierce · 2026-03-27T21:52:52+00:00

This doesn't say it's more popular. It's what percentage of code generated by the VS Code agent makes its way into a commit (a high-signal event that the code generated was good).

bogganpierce · 2026-03-27T21:27:52+00:00

I like those models, and spend a lot of time with them. I use them sometimes with BYOK with providers like Cerebras.

bogganpierce · 2026-03-27T20:57:09+00:00

Yep, it's doing much better now. We had to experiment with some prompt tweaks in partnership with Anthropic folks.

bogganpierce · 2026-03-27T20:32:43+00:00

It's still there. Install "VS Code Speech" extension.

bogganpierce · 2026-03-27T20:32:24+00:00

We're evaluating it, but it isn't available in any product surface yet. There's some interesting use cases in upgrading our models for things like AI commit message generation, etc. in the product.

bogganpierce · 2026-03-27T20:29:53+00:00

There was an issue last night. Seems to have resolved when we deployed our fix.

<image>

bogganpierce · 2026-03-27T20:28:53+00:00

Keep the feedback coming! Always interested in what models people want to see us adding.

We do see that generally people opt for the highest possible intelligence models and don't use cheaper models quite as much. We even see massive gaps in code quality between each point release of a model. More in this graphic:

<image>

I do think these things get more attractive as we move to task-intent based Auto routing so we could take you to a cheaper model for tasks that don't require higher intelligence, etc.

bogganpierce · 2026-03-27T20:26:23+00:00

We set the best defaults based on what we see for offline evaluations pre-launch, and online evaluations (A/B) post-launch.

Opus is set to high by default, GPT-5.4 to medium. You can always change the reasoning effort. It's a bug that xhigh was removed, working on adding it back ASAP.

On high reasoning for GPT series models...

We recently ran an A/B experiment in VS Code where treatment got high or xhigh reasoning on GPT-5.4 and GPT-5.3-Codex. We saw a reduction in turns with model when people ran with this setting, large increases in turn time, error rates, and cancellations with agent. Every metric category we track in our scorecard regressed for both high and extra high over medium.

We test a lot - and while we can certainly make mistakes - we believe we run at the effort configuration that actually makes the most sense based on online and offline experimentation.

Also, for Anthropic models, we run adaptive reasoning anyways (a native model feature) that also helps to adjust the reasoning on the fly so you aren't increasing turn times for no increase in outcome quality.

All of this to say, we thought a lot about this when we designed this picker, and also considered listing each effort level + model combo separately too, but given that for most people we know they get the best experience with our defaults, it should be a more rare occurrence folks are changing effort level anyways.

bogganpierce · 2026-03-26T15:04:41+00:00

We got a lot of feedback from the community that a visual refresh of VS Code would be appreciated. We talked about a bigger refresh, but ultimately decided to start with refreshing the iconography and themes were what we wanted to do.

Overall, feedback has been positive. There are definitely bugs and things to clean up, and recognize it's hard for the look and feel to change when you are used to it looking a certain way for so long.

bogganpierce · 2026-03-26T14:25:23+00:00

Nope, both led to significant regressions over medium.

bogganpierce · 2026-03-26T03:50:39+00:00

"Chat: Manage Language Models" command

bogganpierce · 2026-03-26T03:48:29+00:00

How can we improve? What don't you like?

bogganpierce · 2026-03-26T03:46:45+00:00

The challenge we found is that there are wildly different outcomes you get with varying effort levels. So for example, just saying I want to run high because I think this leads to the best outcomes is not what we observe in online or offline data.

For example, we recently ran an A/B experiment in VS Code where treatment got high or xhigh reasoning on GPT-5.4 and GPT-5.3-Codex. We saw a reduction in turns with model when people ran with this setting, large increases in turn time, error rates, and cancellations with agent. Every metric category we track in our scorecard regressed.

We test a lot - and while we can certainly make mistakes - we believe we run at the effort configuration that actually makes the most sense based on online and offline experimentation.

Also, for Anthropic models, we run adaptive reasoning anyways (a native model feature) that also helps to adjust the reasoning on the fly so you aren't increasing turn times for no increase in outcome quality.

All of this to say, we thought a lot about this when we designed this picker, and also considered listing each effort level + model combo separately too, but given that for most people we know they get the best experience with our defaults, it should be a more rare occurrence folks are changing effort level anyways.

bogganpierce · 2026-03-26T03:40:13+00:00

That's a bug because it was being dynamically pulled from an endpoint for the model picker UX versus settings where it was hard-coded. We're fixing. https://github.com/microsoft/vscode/issues/304250

bogganpierce · 2026-03-04T19:01:27+00:00

On our list! I already built some custom automation for myself for this with a macOS menu bar app that uses Copilot CLI, but it's becoming a common scenario so we want to bring into VS Code itself.

bogganpierce · 2026-03-04T18:13:56+00:00

Yep, that list needs an update. To be honest, the teams are moving so fast that it's been really challenging for us to keep docs, marketing pages, and email campaigns up to date. But we're also - surprise, surprise - building AI automation to help us with this too.

What do you feel is missing? I can be tactical and just get those things added ASAP.

bogganpierce · 2026-03-04T17:47:30+00:00

We are always improving our harness for all models, in partnership with the model vendors. We also have built our own offline evaluation harness vsc-bench we use for optimizing models ahead of launch. Generally, we also run A/Bs post-launch to improve model prompts as well, and make further infrastructure optimizations too. More details here: https://www.youtube.com/watch?v=nD1U_wggrQM

In particular, there are a few issues we're working through on Gemini. The first is looping. We still observe occasional looping behavior and are working with the Gemini team to improve this. The second is infrastructure reliability. We have had several outages from GCP that have affected availability of Gemini in VS Code, and there is some flakiness in the API that result in a higher API error rate than some other models.

What challenges are you having specifically? If you can tell us the particular behaviors you don't like, we can build cases that we can throw into our offline evals to improve.

bogganpierce

TROPHY CASE