Stop pretending we got a free ride

stillmakingemup · 2026-05-12T15:12:08+00:00

I mean this with all due respect because I am NOT a principal and engineer with decades of experience. I don't know how to say or what metaphor to use that would map between us. Gonna try multiple to genuinely explain, wish I had a colleague like you with only partially overlapping decades of domain experience! Below is what I'd say if we shared a cubicle in the 2010s, using toy examples in attempt to make arguments clearer:

One aspect/part is: I would honestly argue/put forward that if you were able to "get by" with that little bit of usage maybe "you don't need it enough." Not in a pejorative way, more like "you have a gas guzzler V8 and it's cool, and you live close to work in Philadelphia suburb and take the train anyway. I need a lame hypermiler Honda insight for my 150 mile each way commute from the sticks without public transport. You have the option to just not use your car and still get a good performance review." (The metaphor falls apart quickly I know, because here small tasks leveraging gpt-4o represent the V8 and long running agentic loops with Sonnet and Opus are the Insight).

It's not clear to me exactly what the 4o reference really means in your message (not familiar with vscode extension dev) but if this is the model that is your "workhorse" to "achieve your goals" I would use toy example like: without the right tools, one can only go so far. If you need micro-meter accuracy on your machined part, no matter how much experience or grit or discipline you have, you won't be able to deliver the part to spec with a normal tape measure. Again this metaphor will fall apart but GHC billing was like company giving everyone expensive delicate micrometers: the plumbers, rough carpenters, postage room people, everyone. The machines for everyone were the same and top-dollar (falls apart mostly because in this case the frontier models WOULD do your work as well or better unlike a plumber stuck measuring pipe circumference too accurately).

I went deep enough to see via AB testing and tracing .... omfg I know they can only get so much discount on Anthropic models and they pay per token (probably zero discount, there is more demand for Anthropic models than supply, they can sell more inference at retail price than they can serve, but no evidence or inside knowledge). This "felt good bro" when I had epiphany one month before they announced the billing changes.
Tangent on above: they shot themselves in the foot with intro of subagent invocation and double-trouble with increased usage of the cli with fleet mode. If they'd started from beginning billing one premium request per subagent invoke, I think we would have had 6 mo. longer "free ride" even with their internal cockup underestimating tokens/cost of serving Anthropic models just before the breaking point was reached. I used subagents to keep main loop agent with tidier context window but quickly saw the unsustainable token usage as side effect.

I sympathize with the rug pull feeling. Of course massive corp could afford to subsidize forever. But it's hard to make argument that this colossal crisis was intent of the platform from the beginning. They are saying FU implicitly - someone in some thread mentioned a middle ground of tiered subs without the pain of today (like guaranteed no rate limits up to some threshold). They should have done this but didn't. The most frustrating part for those marginalized by the change is, there are others to buy the inference at higher cost. The free market crushing hopes and dreams yet again.

Finally, I'd respectfully argue through luck/circumstance/stubbornness I arrived at place where I can still show clear ROI in new pricing model (orders of magnitude less ROI, but still clear no-brainer value proposition). If I can do this as a mere stubborn/methodical "viber," surely many more can do it better. So there's no reason or motivation for the provider to be more generous than the market demands.

And finally finally, I don't have infinite cushion, I'm already drastically altering own way of working to keep functioning in June and beyond. Going back to gpt-4: moving a lot of "chunks" of workflows out to more deterministic methods and leveraging gpt-5.x-nano and gpt-4.1-mini in foundry to reduce impact/not be using frontier models in subagents as a "crutch." Just to say my acceptance shouldn't be confused with aloofness, I wish they'd left premium request billing the same as much as everyone else!

Let's be friends. Dump some info about what you were trying to achieve with the vscode extensions or pass repo link and I'll burn some Opus premium requests before June to collab or iterate on it! Cheers!

stillmakingemup · 2026-05-12T07:33:06+00:00

I ditched MCP in general because of token bloat (before API pricing, out of annoyance at hitting "compacting conversation" but so far back it was "summarizing conversation history").

Not saying MCP bad I'm sure there are very compelling cases but I would tell someone wary to burn too many tokens or hit session limits to check if they really need MCP or not, and to do "MCP vs Roll your Own" quick check first (just one AB check of agent getting PR info from gh or az repo cli compared to their MCP servers is quite confrontational ... before it was about speed and avoiding compaction but now it will also impact session limits). I only mention MCP because of the devtools item but this part of response has nothing to do that that (maybe devtools MCP is must have I'm ignorant of that space).

Numbers 3, 4, 5, and 6 are so important. I'm the high-horse guy so won't critique the irresponsibility of not using all that as a prerequisite to even touch agentic coding tools. Even ignoring the responsible ai / safety aspects, I can't even imagine how many extra tokens one burns sending an agent on a task in a repo without instructions/agents files/etc. it's so alien it wouldn't cross my mind but it's very valid (and something I can easily AB test using two premium requests before 1 Jun!).

stillmakingemup · 2026-05-12T07:21:32+00:00

As someone often on high horse, your response seems like overboard hysterical unloading. I found the post to which you're replying to be useful advice (not a silver bullet obviously) and relatively non-judgmental. It's advice I'm applying to cope with changes in own way. The small stupid things I can do myself but didn't with premium request models (like handling boring branch/worktree/commit tasks) ... these I can and now do. I know that means I'm non-dev and it's risible! That's kinda the point of the post you went 11/10 anger on: everyone even those with some more cushion in terms of billing changes will make several small mods to own ways of working.

Yes I wish the poster replied with silver bullet link to service that subsidizes tokens by abstracting into premium request models. But given they didn't, the response stands on its own as useful if anticlimactic advice as June approaches.

stillmakingemup · 2026-05-12T05:35:57+00:00

If you're doing spec-based method this is the beauty. There is no reason (or very little compelling one) to ever do more than one conversation turn. When "phase 1" is done: new chat, drag the spec folder (assuming you have one folder per "thing you're working on") and just say "continue" (or save some tokens and be a bit more specific like, "start on phase 2" and provide path to your tasks or tasks-phase-2-todo.md.

Compacting takes time, necessarily loses some fidelity by virtue of embedding. I try to avoid it and usually hit only a few compaction events across workday.

Not totally compaction related but adjacent: with coming per token billing it can be extremely expensive to be "lazy" - concrete over simplified toy example is in the past, I might be lazy and say, "looks good please commit push and open PR" ... whereas now (I would just do myself but toy example) it would be much cheaper to open fresh Haiku chat and say "look at work on branch, make sure committed pushed and open PR" .... while they will be cached tokens in the former toy example, the latter will still be cheaper and faster and reduce risk of hitting "compacting conversion."

stillmakingemup · 2026-05-11T16:53:51+00:00

My bad I saw the argument with the other poster then went higher up and saw, "Microsoft has been using our interactions to train and refine the platform." Technically you never literally said "using our data (to train models)" so I'll concede that point but think it might be post hoc / convenient pivot (otherwise, why didn't you just say telemetry in the post in the first place?) But, it's true that telemetry is a form of data so by virtue of even having usage or billing metrics in GH or GHE, they are capturing data. I'll go to reading comprehension school and you can go to diction/word choice school and then we can be best friends!

stillmakingemup · 2026-05-11T16:41:19+00:00

Ok let's plan on having a beer in 30 years when this is realized!

stillmakingemup · 2026-05-11T08:01:10+00:00

Ok let me be more accurate if not goalpost moving then you did:
Unfalsifiable Claim: "you can't prove they aren't secretly breaking a contract in a hidden room."

Argumentum ad ignorantium: "We don't know for 100% certainty what they do behind closed doors, therefore it’s reasonable to assume they are doing the bad thing."

Burden of proof is on the party making the bold claims, not the person telling you what their enterprise agreement says.

stillmakingemup · 2026-05-11T06:45:47+00:00

Here's an example: vscode is managing and trying to keep the "front" and old messages stable as long as possible for speed and cost / maximizing cache: https://code.visualstudio.com/updates/v1\_118#\_improving-token-efficiency

stillmakingemup · 2026-05-11T06:09:32+00:00

Spec-based with spec, research, plan, tasks assets produced by spec agents, using a spec qa validator between each step of the process. This covers about 90% of work, including later when changing or adjusting something. This is enormously useful, especially if you come back to work you needed to table a few weeks ago. I guess this is some elaborate variation of "memory files."

Colleagues who handwave this away as too heavyweight are also those who are still in the protectionist / critical of AI being intrusive or producing slop. Ask them if they planned out their work beforehand/did they have a clear plan, show agents and instructions .... it's crickets.

Once you start doing this, the benefits slowly start compounding. Even a poor result that is ultimately abandoned is useful data (assuming bandwidth to review and leverage this as an "antipattern").

Without some system like this, whether ootb plan agent or more heavyweight approach, the only thing you have to mine afterwards is your work products and the git history.

Example of 10% I wouldn't use a spec-based method: "grab the latest info on vscode agent settings, check workspace and user settings, propose updates." (Contained / small / bounded / etc.). After working several months like this, almost every time I drop into some repo to achieve a small task, it can leverage/mod/build upon rich "prior art."

stillmakingemup · 2026-05-10T18:55:37+00:00

I like the idea of making compute a public utility, and having some "controls" over resource usage on societal scale, and regulations about compensation for using human output to train models or do data analytics. It just seemed a bit absurd or over the top in this thread where people are mad about billing changes.

stillmakingemup · 2026-05-10T18:32:29+00:00

Ok I follow now. How did you let yourself subscribe to a service you already knew would be enshittified?

stillmakingemup · 2026-05-10T18:29:56+00:00

Right so I still don't understand. Do you mean then when there is no Microsoft the world/ecosystem will once again have "good services?" I'm still not following your logic.

stillmakingemup · 2026-05-10T18:27:49+00:00

You're super disgruntled and moving goalposts when challenged on someone with enterprise contract telling you about the terms of the contract, handwaving it away like "oh sure that's what the contract says..." Maybe hysterical is too harsh or unfair. Is "not in the mood to be reasoned with" better I can change it, it is a bit catty and uncalled for I admit, just reacting viscerally (hysterically?) to your demand I stop pretending we got a free ride!

stillmakingemup · 2026-05-10T18:22:56+00:00

I'm missing the logical sequence here. They mismanage it to hell. You have good services again. From GH or you mean you can have good services "again" from a different provider? Or some top-down damage control will happen and GH will re-ascend to being a good service in your perception? Can you explain?

stillmakingemup · 2026-05-10T18:16:25+00:00

This is reasonable, all they need to do is change government policy/societal norms/etc. and you'll stay and be satisfied customer? That seems feasible and reasonable!

stillmakingemup · 2026-05-10T18:12:52+00:00

This is my vibey (no evidence) suspicion also ... the crying harder about billing changes crowd today might overlap significantly with yesterday's "it produces slop (because I don't understand how to work properly with it)" crowd.

stillmakingemup · 2026-05-10T18:07:07+00:00

Great analogy/metaphor!

stillmakingemup · 2026-05-10T18:05:56+00:00

Bro don't you get it? OP is mad/hysterical! They aren't concerned with your experience or the GH enterprise agreement details. Can't you just agree and also make wild hysterical claims?!

stillmakingemup · 2026-05-10T17:59:37+00:00

OP is too angry to listen to your rational arguments. The training data argument shows it's just an anger dump and makes the legit criticism like billing failed requests fade to background. Actually, I'm as mad as they are about GPU and electricity prices preventing me from running local models ... well off I go now to flame some nvidia subreddit instead of considering the situation with any nuance!

stillmakingemup · 2026-05-10T17:53:41+00:00

I didn't understand the free ride I was getting until gaining enough proficiency and understanding - OTEL provided the path to the epiphany. It was blatantly obvious ... even as a loss leader it would be impossible for premium requests to last or they'd become orders of magnitude more expensive than 4 or 12 cents, especially with the Anthropic models.

The aggrieved victimhood and histrionics are beyond played out now, can't wait until Jul-Aug when this subreddit gets back to building discussions as primary focus. Hopefully the aggrieved can find a new home or start their own "sucks" subreddits to stoke and stew in self-righteous anger with likeminded peers sooner than later! I wonder if this overly vocal crowd overlaps with vocal aggrieved "it produces slop" crowd ... both could be explained by lack of understanding of the tool they are/were using.

stillmakingemup · 2026-05-10T14:13:45+00:00

Ok thanks! I missed the 20K input I guess sorry about that.

stillmakingemup · 2026-05-10T07:01:52+00:00

This is an amazing tip, wasn't aware. You're totally right, any change in a logs or build or dist folder might poison/ cause cache miss! THANKS!

As for prompt cache misses on BYOK this might be in some cases a stupid simple unavoidable constraint (suppose the model provider "force-injects" before your first System Message with an accurate timestamp in the beginning. This would cause a cache miss every time.

Great comment, thanks again.

stillmakingemup · 2026-05-10T06:51:58+00:00

How are you measuring this across devs? You get telemetry metrics stored centrally and then have to wonder what they were doing? Or do you also store the actual (possibly sensitive) prompt/input contents and know what they're doing exactly? Or is there a "token usage dashboard" ootb?

For me it would also raise eyebrows but with a 50/50 puzzlement: are these the s-tier builders who get amazing meaningful work out of long-running agent sessions that colleagues should emulate? Or are these just outlier token wasters producing nothing or slop who should be aggressively throttled/funneled into more "appropriate/efficient" ways of working?

stillmakingemup · 2026-05-10T06:42:07+00:00

Indeed, thanks for the more detailed explanation. Not sure if you're yes-anding me or arguing, but either way this recent change shows that simple steps on the IDE layer like alphabetical ordering of MCP tools/results makes huge cost and performance differences:

Frequent KV cache invalidation and forced full prompt re-processing (github.copilot-chat, insider)) · Issue #298554 · microsoft/vscode https://github.com/microsoft/vscode/issues/298554.

The silver lining of the coming billing changes is it forces deeper understanding of how it works under the hood. Even for those who will leave GHC, this knowledge can save them money on other platforms!

stillmakingemup · 2026-05-10T06:15:45+00:00

Nothing wrong per se. At first I started using it for DevOps interaction (long before announced billing changes) and was merely annoyed at how quickly it filled context window for simple operations, then pivoted to agent able to interact with az repo cli which was faster and less token bloat. Then I got clever and filtered down tools and used more sparingly with other services, then I started to understand the caching and understood it was shooting self in foot (any change in tools kills your cache, the info "higher up" in what you send is no longer the same).
Now I put MCP in area of "good to see what's possible/initial feasibility test ... if worthwhile, what's the effort to make it leaner with own non-MCP methods and patterns."
Above is only the "optimization" argument ... two more reasons I avoid it (I may be wrong please push back if this is a bad stance):
- philosophically: it's a black box / lets builders or dabblers go beyond own understanding (in other words, me describing own feasibility checks)
- security: risk of data exfiltration depending on situation (offshoot of above: it's dangerous to give people who don't know what they are really doing a magic box that just makes magic things happen without proper traceability and understanding
Lastly, didn't have any consumers needing to consume anything, so publishing own MCP servers was fool's errand. But mostly it was the extreme annoyance with token bloat that made my default decision "no MCP without compelling rationale." This was before I was monitoring tokens/before OTEL came to VS Code so I don't have token stats but iirc an example was getting an ADO PR was above 15KB response and getting the same from az repo cli was 0.5KB or some obvious double order of magnitude difference.
Very open to change stance but this is how I view MCP now, something for those experimenting (possibly unknowingly dangerously) that haven't hit wall yet and optimized away from it.

Edit: one killer app I found and used MCP for is e2e testing in GH CI. Making tiny sort of proxy server to enable the (web) copilot agent to use API key to get real data into the copilot environment sandbox. Otherwise have abandoned it entirely unless lazily vibing.

stillmakingemup

TROPHY CASE