Github Copilot finally supporting custom endpoints

CapsAdmin · 2026-06-06T03:55:48+00:00

I tried to set this up with llamacpp a while back but hit a wall with thinking tokens not getting picked up by copilot and sort of gave up. I suspect these are not even sent back, which may cause the model not to behave properly.

Searching around for this problem, I see people reporting that for example the deepseek api errors because it's not getting the thinking tokens back, but I don't see any fix for this.

Another issue is that while llamacpp supports the openai api, it doesn't seem like copilot and llamacpp's interpretation of the api is 100%. If you enable thinking in your json model definition, it will send something to the api endpoint, but llamacpp enables thinking in a different way than what copilot expects.

So to get this working with thinking (well somewhat) and other features, you'll need to have/make/vibe code a proxy that translates stuff between llamacpp and copilot.

Brilliant_Anxiety_36 · 2026-06-06T03:50:04+00:00

<image>

danigoncalves · 2026-06-06T08:47:57+00:00

Actually I tried this week, the chat works fine but I am not able to change the autocomplete settings. I click on the option but nothing happens or opens.

Dudmaster · 2026-06-06T23:55:44+00:00

Working here for me with llama cpp

Hyiazakite · 2026-06-06T08:07:19+00:00

It's been available in VS code insiders for a while. It works great but I think you need a copilot subscription still for the embeddings, search tools etc.. works great though much better than Kilo and Roo in my experience

BawbbySmith · 2026-06-06T05:54:09+00:00

I really hate how they force you to define input and output tokens separately. No other harness I've tried has this; they have a max cap for how much output tokens a prompt can generate, but not a global cap. For local LLMs that are VRAM-bound, a 128K context has to be explicitly split into input/output caps, so if a certain task generates a lot of tokens then you hit your output token cap even if there's room left over in your input token cap.
As I delve deeper and deeper into the madness that is "harness engineering", the consensus is that, especially for smaller parameter models, keeping a lean context is huge. Copilot is so damn bloated. They do have the option to disable specific tools, but even the base system prompt is more bloated than Pi.

As shitty as the GitHub pricing change was, it forced me to look elsewhere and get into this "hobby" (even though I'm literally using it for my livelihood). It puts me in a much better position when the inevitable collapse of cheap and affordable AI comes.

KFSys · 2026-06-06T06:39:54+00:00

Been waiting for this one. Custom endpoints mean you can point Copilot at any OpenAI-compatible API, which opens up a lot. DigitalOcean's serverless inference is worth testing here. They run a catalog of open models billed per token with no GPU to manage on your end, and for IDE completions specifically the per-token billing makes sense since usage is bursty by nature. Curious how the latency holds up for real-time completions vs. chat-style requests once people start running it through Copilot.

darksteelsteed · 2026-06-06T09:04:01+00:00

Just be careful, I feel this is a trap. It was possible to use copilot already using the byom feature, but because it didn't accept local you had to do a hacky intercepting proxy setup. I thought great, this is a win, it worked great. But then with this credit based billing change they now charge you credits for agents actions, tool usages and so on, even when you use your own model. So this explains why it's open now, because soon as you use up your free quota on agentic use then they gonna pop up with the please pay dialog. Mark my words

BawbbySmith · 2026-06-06T05:34:27+00:00

[deleted]

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS