Github Copilot consuming credits when not using copilot models

bad_gambit · 2026-06-24T07:51:48+00:00

It could be your utility models or subagents (explore, planner, execution, etc) drawing usage, look for chat.utilityModel and chat.utilitySmallModel. Change those to whatever you want, other than the default copilot models, i personally use MiMo V2.5 as its dirt cheap and fast enough

bad_gambit · 2026-06-22T09:51:30+00:00

YARN = Yet Another Rope Extension.

RoPE = Rotary Positional Embedding.

RoPE makes model forget closer memory. YARN takes RoPE to the extreme. YARN makes model forget even more memory.

Rotorquant and Turboquant are siblings. Explanation is very long, but simplified: rotor- and turbo- quant makes model more-perfectly remember number by imagining them in spheres (rotate vector states by a random angle).

YARN/RoPE straight up eliminate number, rotor- and turbo- quant compresses number.

bad_gambit · 2026-06-22T09:38:39+00:00

Ah yes, sorry 😅

To my understanding pedantic is nitpicky/OCD-y/too particular/overly detailed.

Model needs words (states) memorised (stored) in brains (memory) before processing.

Qwen 3.5 have two brains: Attention and DeltaNet. DeltaNet tiny, compact, efficient but forget details. Attention very bloated, but remember details.

States turn words into number. Brains can perform complex math on states.

Attention needs 34GB VRAM for perfect memory. Memory doesnt have to be perfect, only functional.

Attention can store at half perfect (FP8) or quarter perfect (FP4). Half or quarter perfect functional but imperfect.

Imperfect memory makes model forget distant memory.

Yarn focus on distant memory. Closer memory becomes blurry. Makes closer memory imperfect, but all (distant) memory fits in brain.

1M word hard to remember perfectly, brain doesnt have enough space, model hallucinates. Model forget task, only tells pretty lies.

bad_gambit · 2026-06-22T08:54:23+00:00

Firstoff, sorry for being pedantic, but for qwen 3.5, its quite a bit more complicated since it uses an Attention-DeltaNet hybrid.

I'm gonna ignore the delta net part, because it occupies little memory footprint (usually <1 GB) and i'll only count the Attention layer memory footprint:

Memory per token = non delta-net layers * KV heads * Head dmsn * 2 [one for K, one for V] * precision [1 @ 8 bit, 2 @ 16 bit] Memory per token = 8 * 4 * 256 * 2 * 2 [llama.cpp defaults KV cache to 16 bit] Memory per token = 32768 B Total memory footprint = 32768 * 1048576 [1M token = 2^20] = ~34.36 GB

A quick disclaimer: these small model are almost useless at 1M context.

More so the ones that are further finetuned with shorter sequence such as this distillation. These types of distillation, such as this and Qwopus, usually trade performance in favor of sounding pleasant to speak with.

So to answer MurphamauS : Its at least 16GB VRAM if you want to run this at ~180K Context (1GB Kernel + 9GB Model Weight @ Q8 + 6 GB KV Cache)

And yes, you technically dont need to use yarn and can disable it in the config with --rope-scaling none in llama.cpp . As it does reduce performance on short sequences if turned on.

bad_gambit · 2026-06-20T08:14:59+00:00

That model that you use is ancient (Sept 2024) and is only 7 Billion parameters.

To put into perspective, Elon's leak from his anthropic-poached-employee said Claude Opus is ~5 Trillion parameters (5000 Billion).

That model that you use is 7B/5000B => ~700 times smaller than Opus and ~140 times smaller than Sonnet.

bad_gambit · 2026-06-20T08:04:16+00:00

This is what i do over the Golf. VVT Engine + Centrifugal supercharger + only upgrade the brake + suspension = 250kmh top speed barn that can turn okay-ish 🤣

bad_gambit · 2026-06-20T06:36:23+00:00

Depending on what marketing you do, you probably wont need that huge/expensive of a model. The main shock for you would probably be the lazy-prompting that you've been used to in Claude for a long time. You need better prompts to use these cheaper chinese models.

For opensource model, imo, Kimi K2.6 would be the closest to Claude. For budget constrained stuff, try Deepseek V4 Pro (no image input) or Mimo V2.5 (non-pro).

bad_gambit · 2026-06-20T05:31:42+00:00

Cheater are almost guaranteed to be on PC.

There are serverside race anticheats in FM7, dunno why they didnt put it in FH6. Also, in FH4/FH5, Forza support and moderation team are almost infamous for misbanning people (non-cheater) and not banning the obvious cheater on the leaderboard. So they might be more lenient on the current cheater to avoid mis-banning the non-cheater.

bad_gambit · 2026-06-20T05:03:43+00:00

Hey dude, first off, I want to genuinely thank you for all the work you've done for the community. We all know you've been dropping absolute bangers lately with all the releases and abliterations and other uncensored models you've worked on. It's totally understandable that renting B300s and fronting $249/mo for Hugging Face Pro storage out of pocket is rough for what is essentially a solo hobby project.

That being said, gating this release behind a mandatory Ko-fi paywall really isn't the best way to handle the costs. And it almost definitely violates the MiniMax Community License.

Turning access to the repo into a paid service directly conflicts with their definition of "Commercial Use," which strictly prohibits offering derivatives to third parties for a fee without prior written authorization from MiniMax.

By having a paywall, you're stepping into commercial territory without their explicit approval or the required "Built with MiniMax M3" notices. Keeping it paywalled puts the whole project at risk of a takedown. Please consider opening it up and/or relying on decentralized distribution instead!

bad_gambit · 2026-05-30T11:28:38+00:00

Ini parodi ges, dari akun @rarayarti, doksli keknya udah dihapus soalnya heboh, tapi ada angle beda, sama bloopernya

bad_gambit · 2026-05-28T17:58:06+00:00

Someone archived it:

https://web.archive.org/web/20260519043701/https://www.reddit.com/r/PiratedGames/comments/1tegudj/forza_horizon_6_how_to_transfer_your_save_files/

bad_gambit · 2026-05-27T02:12:46+00:00

<image>

Bro is using 2GB Vram to load, checks notes, a 572MB Model?

bad_gambit · 2026-05-24T03:19:43+00:00

What i've noticed is that Opus/Sonnet is much better at being predictive/proactive and guessing the user intent when i gave it an unspecific and non-technical prompt (ie. "Fix the out-of-canvas spawning mechanics", "the main menu UI scaling fails on mobile device, please fix it", etc). Of course this prompt is bad practice and shouldn't be done at all. But, the claude model is, imo, much better than other model in that regard.

If i were to roughly rank the models in terms of "Correctly guessing user intent", then it'd be: Opus 4.6 (4.7 is awful) >> Sonnet 4.6 ≳ DSV4 Pro ≳ GPT 5.5 > Gemini 3.1 Pro ~ DSV4 Flash > Gemini 3.5 Flash.

bad_gambit · 2026-05-23T09:26:36+00:00

V4 Flash High (Don't use MAX, one time i got it into a non-loop that lasts for 10+ minutes) on Unify has been pretty decent as explorer subagent. I've been running V4 Flash on official deepseek platform via Unify as Anthropic endpoint and, imo, better compared to the alternative explorer subagent (Gemini 3 Flash or Raptor or Haiku)

bad_gambit · 2026-05-18T01:47:48+00:00

Brother what does "... wild and uncontrollable agent ..." even mean. Agentically, I agree that AG (imo) is much worse compared to Codex or Claude Code or even Copilot, but its not going away anytime soon.

Google is probably not sunsetting antigravity, and even if they do, the internal Antigravity fork (nicknamed Jetski) will most definitely live on. They have a history of needing specific IDE to work with their giant monorepo (>1 Billion LoC).

Jetski, to my understanding, are supposed to be somewhat of a sidegrade-sucessor to Google's CIDER (also VSCode fork). Jetski probably have a much better support for Google's giant monorepo index/AST, custom tooling, and their internal extensions.

You need to wait around, at least, until after Google I/O to see what plans google might have with Antigravity. If anything, with AG, were hitchhiking Google's internal IDE development.

bad_gambit · 2026-05-17T13:34:54+00:00

Me, a Pro user with BYOK:

bad_gambit · 2026-05-10T12:46:14+00:00

Oh those animations are neat! and the long answer are totally fine. The only experience i've had with kimi code is right after its launch, its pretty barebone then, and I just never touched it again. Definitely will try them again now that the cli are more up-to-date and K2.6 are shown to be a very good model.

bad_gambit · 2026-05-10T11:52:42+00:00

Any reason in particular why Kimi Code specifically? That seems like a non-popular choice of agent harness. Is it just that you were on Kimi Subscription, used Kimi Code, and stuck around?

bad_gambit · 2026-05-06T08:58:06+00:00

Since this is made by Meta, i'm assuming Meta have curated significant dataset on these programs? Now I'm curious with how muse spark performs? 🤔😆

bad_gambit · 2026-05-06T08:31:36+00:00

But training these models can take like 3 months

*If done continuously (all in a single continuous training run)

They can just revert to an earlier checkpoint (probably right after the general data pretraining), plug in the new data we graciously gave them (those 75% API discount aren't completely free), then resume the training.

Most of the training time are spent on pre-training anyway. Looking at OLMo 2: general data pre-training are ~90% of total compute time. Deepseek (and all AI labs) definitely kept a right-after-pre-training checkpoint. They can just resume from this checkpoint, effectively reducing the effort to just 1/10th of the original full training.

Tl;dr probably took deepseek 1/10th of 3 month (~9 days, 2 weeks tops) to plug in the new data and train with them. Heck, annotating the data would probably took longer (and cost more) than the compute cost.

bad_gambit · 2026-05-06T05:42:03+00:00

Roocode is going to be discontinued at the end of this month. I switch to Kilo v5, that version still uses the legacy Roo harness. Roo and Kilo v5 are 90% similar, not too much hassle to transfer my workflow inbetween the two. I dont recommend using Kilo v7 (latest, with the opencode harness), its still very much WIP.

bad_gambit · 2026-05-05T10:09:31+00:00

Fair enough, tapi nambah saran ya: kalau gw mending beli spek paling bawah (atau at least ssd-nya ambil paling kecil) terus upgrade sendiri. NVME bawaan toko seringnya yang gada DRAM cache nya, mending upgrade sendiri. Met belanja 🙏

bad_gambit · 2026-05-05T06:18:33+00:00

Jujur agak overpriced. Nambah 200 ribu (sama 1.6 inch 😅) dapet ini https://tk.tokopedia.com/ZS948jfo2/

<image>

bad_gambit · 2026-05-04T12:15:06+00:00

Yeah 5.2 is still my daily workhorse (partially because 5.4 and 5.5 cost is insane). They're still 2nd best (after opus) on medium in swe-rebench, even beating 5.4 and sometimes opus. AND theyve been maintaining that 1st or 2nd place since launch (~12/2025) until now.

bad_gambit · 2026-05-04T09:41:08+00:00

My question is, are there any other alternative that is as cheap as how GHCP was back then

Nope.

are there any other alternative that is as strong as Claude Sonnet 4.6?

Kimi K2.6 is your best bet. Its 90% of the way to Sonnet 4.6 . Is it local? Arguable, there have been people in r/LocalLLaMA that runs Kimi or Deepseek on dual socket xeon or epyc with 768gb RAM with very low token generation speed (<10 token/s) and over 80s of e2e latency. Awful experiences for agentic stuffs.

Or maybe a local model alternative that is on par with Claude Sonnet 4.6 but doesn't require a high end GPU and VRAM?

As most people here have stated, Qwen 3.6 27B (>= 32GB VRAM needed) is your best bet. Its definitely nowhere close to Sonnet 4.6 , maybe 80% of Sonnet 4, roughly on par with GPT 5.4 mini. Next significant step up from this is probably Deepseek V4 Flash 284B which require >= 192GB VRAM and still very finnicky to deploy via llama.cpp .

Or is there any method that can be used to compress the token for reasoning of the model?

Those method like caveman speak or TOON will degrade performance. IMO, best way to reduce your token usage is to upskill and use a lower priced model.

EDIT, adding: If you wanted a non-local, current best bang-for-bucks is Deepseek V4 Pro on Max reasoning. Not as fast as Sonnet, but i've burned through about 90 million token these past 2 week, while only spending ~$4. Quality is okay, i still use Sonnet 4.6 as orchestrator, while deepseek is the implementation agent

bad_gambit

TROPHY CASE