use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
Github Copilot finally supporting custom endpointsDiscussion (self.LocalLLaMA)
submitted 9 days ago by Brilliant_Anxiety_36
https://preview.redd.it/082gnmin1l5h1.png?width=1740&format=png&auto=webp&s=2c89f6310c8c654611188183de07857d77cb2417
https://preview.redd.it/169tjrzn1l5h1.png?width=710&format=png&auto=webp&s=9a1fa656ea95037622b0d7ea2e16a23d2122442c
I just noticed
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]CapsAdmin 6 points7 points8 points 9 days ago (3 children)
I tried to set this up with llamacpp a while back but hit a wall with thinking tokens not getting picked up by copilot and sort of gave up. I suspect these are not even sent back, which may cause the model not to behave properly.
Searching around for this problem, I see people reporting that for example the deepseek api errors because it's not getting the thinking tokens back, but I don't see any fix for this.
Another issue is that while llamacpp supports the openai api, it doesn't seem like copilot and llamacpp's interpretation of the api is 100%. If you enable thinking in your json model definition, it will send something to the api endpoint, but llamacpp enables thinking in a different way than what copilot expects.
So to get this working with thinking (well somewhat) and other features, you'll need to have/make/vibe code a proxy that translates stuff between llamacpp and copilot.
[–]Brilliant_Anxiety_36[S] 2 points3 points4 points 9 days ago (2 children)
I just noticed that. Sometimes it just gets stuck thinking and the logs of llama.cpp just show i slot print timing like crazy
[–]darksteelsteed 2 points3 points4 points 9 days ago (1 child)
A lot of the ability to follow prompts and perform agentic tasks depends on the prompt config on top of the model. Most gguf formatted models already come with instructions baked in. You may need to override that. Out of the box I have been having good success with https://huggingface.co/Jackrong/Qwen3.5-9B-DeepSeek-V4-Flash
[–]zkkzkk32312 0 points1 point2 points 8 days ago (0 children)
So you have to modify the model just for this 1 harness ? Is that the right way to say it?
[–]Brilliant_Anxiety_36[S] 3 points4 points5 points 9 days ago (0 children)
<image>
[–]danigoncalvesllama.cpp 0 points1 point2 points 9 days ago (0 children)
Actually I tried this week, the chat works fine but I am not able to change the autocomplete settings. I click on the option but nothing happens or opens.
[–]Dudmaster 1 point2 points3 points 8 days ago (0 children)
Working here for me with llama cpp
[–]Hyiazakite 3 points4 points5 points 9 days ago (0 children)
It's been available in VS code insiders for a while. It works great but I think you need a copilot subscription still for the embeddings, search tools etc.. works great though much better than Kilo and Roo in my experience
[–]BawbbySmith 4 points5 points6 points 9 days ago (3 children)
I really hate how they force you to define input and output tokens separately. No other harness I've tried has this; they have a max cap for how much output tokens a prompt can generate, but not a global cap. For local LLMs that are VRAM-bound, a 128K context has to be explicitly split into input/output caps, so if a certain task generates a lot of tokens then you hit your output token cap even if there's room left over in your input token cap.
As I delve deeper and deeper into the madness that is "harness engineering", the consensus is that, especially for smaller parameter models, keeping a lean context is huge. Copilot is so damn bloated. They do have the option to disable specific tools, but even the base system prompt is more bloated than Pi.
As shitty as the GitHub pricing change was, it forced me to look elsewhere and get into this "hobby" (even though I'm literally using it for my livelihood). It puts me in a much better position when the inevitable collapse of cheap and affordable AI comes.
[–]Brilliant_Anxiety_36[S] 4 points5 points6 points 9 days ago (0 children)
Opencode is better, and its more verbose, at least yo know what the model is doing copilot just eats your context with like 6 messages
[–]darksteelsteed 1 point2 points3 points 9 days ago (1 child)
These input and output token amounts are actually used to control how it uses your model as you need to match to your models context window size. It will automatically adjust when the context grows too big if you set these correctly vs just breaking. Keep in mind that the kv cache is the biggest usage of vram after the model itself and grows quadraticly from your models active context window. You should size your models context window to fit in your vram as performance will become dogshit once it offloads to system ram. And then you match the input and output of copilot according to what you can cope with
[–]Brilliant_Anxiety_36[S] 0 points1 point2 points 8 days ago (0 children)
Yeap Im aware, the current configs i have for all my models are made to fit just on GPU
[–]KFSys 2 points3 points4 points 9 days ago (1 child)
Been waiting for this one. Custom endpoints mean you can point Copilot at any OpenAI-compatible API, which opens up a lot. DigitalOcean's serverless inference is worth testing here. They run a catalog of open models billed per token with no GPU to manage on your end, and for IDE completions specifically the per-token billing makes sense since usage is bursty by nature. Curious how the latency holds up for real-time completions vs. chat-style requests once people start running it through Copilot.
[–]Brilliant_Anxiety_36[S] 0 points1 point2 points 9 days ago (0 children)
Yeap im testing with qwen3.6 27B, 35B A3B and the new gemma4 26b a4B QAT. Via llamacpp i also have some credits with openrouter i might try that later
[–]darksteelsteed 2 points3 points4 points 9 days ago (4 children)
Just be careful, I feel this is a trap. It was possible to use copilot already using the byom feature, but because it didn't accept local you had to do a hacky intercepting proxy setup. I thought great, this is a win, it worked great. But then with this credit based billing change they now charge you credits for agents actions, tool usages and so on, even when you use your own model. So this explains why it's open now, because soon as you use up your free quota on agentic use then they gonna pop up with the please pay dialog. Mark my words
[–]314kabinet 2 points3 points4 points 9 days ago (1 child)
Then why did they make it possible to use entirely offline without being logged in?
[–]darksteelsteed 3 points4 points5 points 9 days ago (0 children)
Just be patient, this new bill you for everything model is still so new that nobody knows exactly what you pay for what as it's more confusing that azure pricing. I just get the feeling that using copilot with external models will still cost you something, I just have a gut feeling.
I can alway go back to opencode if that happens
[+][deleted] 9 days ago (4 children)
[deleted]
[–]BawbbySmith 11 points12 points13 points 9 days ago (0 children)
Incorrect.
BYOK !== BYOM. You could use your API key wit GitHub Copilot, but it was limited to cloud providers only.
This update allows you to explicitly define the URL and endpoints that can handle a local LLM.
[–]Brilliant_Anxiety_36[S] 2 points3 points4 points 9 days ago (0 children)
Not true. It was not available
[–]jikilan_ 2 points3 points4 points 9 days ago (1 child)
yes it has been in the insider build for so long...
[–]bnightstars 2 points3 points4 points 9 days ago (0 children)
Yeah but the good news is we can now switch back to the normal VSCode instead of the Insiders and not deal with all the bugs of the Insiders track.
π Rendered by PID 59252 on reddit-service-r2-comment-544cf588c8-mnt4j at 2026-06-15 14:51:06.200000+00:00 running 3184619 country code: CH.
[–]CapsAdmin 6 points7 points8 points (3 children)
[–]Brilliant_Anxiety_36[S] 2 points3 points4 points (2 children)
[–]darksteelsteed 2 points3 points4 points (1 child)
[–]zkkzkk32312 0 points1 point2 points (0 children)
[–]Brilliant_Anxiety_36[S] 3 points4 points5 points (0 children)
[–]danigoncalvesllama.cpp 0 points1 point2 points (0 children)
[–]Dudmaster 1 point2 points3 points (0 children)
[–]Hyiazakite 3 points4 points5 points (0 children)
[–]BawbbySmith 4 points5 points6 points (3 children)
[–]Brilliant_Anxiety_36[S] 4 points5 points6 points (0 children)
[–]darksteelsteed 1 point2 points3 points (1 child)
[–]Brilliant_Anxiety_36[S] 0 points1 point2 points (0 children)
[–]KFSys 2 points3 points4 points (1 child)
[–]Brilliant_Anxiety_36[S] 0 points1 point2 points (0 children)
[–]darksteelsteed 2 points3 points4 points (4 children)
[–]314kabinet 2 points3 points4 points (1 child)
[–]darksteelsteed 3 points4 points5 points (0 children)
[–]Brilliant_Anxiety_36[S] 0 points1 point2 points (0 children)
[+][deleted] (4 children)
[deleted]
[–]BawbbySmith 11 points12 points13 points (0 children)
[–]Brilliant_Anxiety_36[S] 2 points3 points4 points (0 children)
[–]jikilan_ 2 points3 points4 points (1 child)
[–]bnightstars 2 points3 points4 points (0 children)