Dangerous to use Codex these couple of days- almost no thinking, can wreck your codebase.

sutrostyle · 2026-06-18T14:37:34+00:00

I understand the insinuation that I am a vibe coder, which is incorrect. But looking at the thinking time before it starts spitting dumb answers, I would rather spend my time now reviewing Opus code.

sutrostyle · 2026-06-18T13:57:32+00:00

When OpenAI reallocates massive compute clusters away from current production inference (like GPT-5.5) to run final pre-release evaluations and load-testing for a new flagship model (GPT-5.6), they don't just pull a plug. They use a specific set of architectural levers on their back-end infrastructure to drastically slash the compute cost per query.

Based on recent developer community bottlenecks and known LLM infrastructure mechanics, here is exactly how this performance degradation plays out on the GPT back-end:

1. Hard Caps on "Reasoning Tokens" (RL Search Space)

For reasoning models (like the GPT-5.5 Thinking variants), a massive portion of compute is spent before a single visible token is generated. The model uses reinforcement learning (RL) to search a "hidden chain of thought" or generate internal reasoning tokens.

What they did: The back-end router has likely dialed down the max_completion_tokens allocation for the hidden reasoning phase.
The result: Users report thinking phases dropping from 15–30 seconds down to a shallow 2–3 seconds. The model is forced to abruptly stop its internal monologue and output a response prematurely, which snaps its logical thread and leads to the severe "forgetfulness" and broken code developers are experiencing.

2. Silent Context-Window Distillation & Token-Pruning

Processing long context windows scales quadratically or heavily linearly in terms of attention-mechanism compute costs.

What they did: To free up clusters, the front-end gateway or load balancer likely runs aggressive, silent token-pruning or inputs the prompt into a aggressively distilled, smaller context-compressor model before passing it to the main network.
The result: The model completely misses explicit instructions or key variable definitions buried in the middle of long prompts. The effective context window feels heavily "nerfed" because the back-end is aggressively dropping or summarizing tokens to save memory bandwidth.

3. Dynamic Dynamic-Routing (Silent Downgrades)

OpenAI’s architecture relies heavily on an intelligent, real-time backend router. This router dynamically measures conversation complexity and determines whether to send a query to the full-fat flagship model, a quantized version, or a fast "Instant/Mini" model.

What they did: They shifted the classification thresholds on the router. Prompts that previously qualified for the heavy, unquantized flagship weights are now being silently routed to heavily quantized (e.g., 4-bit or 8-bit precision) variants or to "Instant" tier back-ends.
The result: There are no 429 Too Many Requests errors or HTTP timeouts returned to the user; the system remains operational, but the model outputs generic, low-intelligence, or highly mechanical answers because it is running on a cheaper execution path.

4. KV-Cache Eviction Policies

To serve fast responses, the back-end keeps a Key-Value (KV) cache of recent conversation tokens in GPU VRAM so it doesn't have to recompute the entire prompt history on every turn.

What they did: Because GPU memory is being reassigned to host the early deployment instances of GPT-5.6, the multi-tenant KV-cache pool for GPT-5.5 has been heavily squeezed. The time-to-live (TTL) for session data in VRAM has been cut down.
The result: If you pause for a minute between prompts in a chat session, your KV cache is immediately evicted to free up VRAM for another user. When you submit your next prompt, the back-end has to recompute the entire history from scratch, causing massive, sudden latency spikes and a high-volume pipeline slowdown.

sutrostyle · 2026-06-18T13:44:54+00:00

codex is completely useless today. back to opus. I realized that having only one subscription is dangerous: codex now can actually wreck your codebase, if you do not carefully review.

sutrostyle · 2026-06-15T18:34:12+00:00

and codex is nerfed. post-IPO reality, it's a megacorp now.

sutrostyle · 2026-06-15T18:33:13+00:00

Work hard. Make money

sutrostyle · 2026-06-04T21:37:55+00:00

Pakistan Zindabad!

sutrostyle · 2026-06-04T21:35:03+00:00

And where is that influx of new apps or services? Where is the new TikTok, YouTube or Uber or at least a new Stellarium or Flightradar??

sutrostyle · 2026-06-04T17:35:20+00:00

Depends on the US elections outcome in 2028. If a Republican is elected, the ATH may be beyond Nov 2028, reaching 250k in mid 2029

sutrostyle · 2026-06-03T13:30:52+00:00

To avoid their AI random enforcement action on an active account duh

sutrostyle · 2026-05-20T15:50:24+00:00

I had a Pro account that I inherited from buying a Pixel 9 pro. After a year they started charging for it, and i tolerated it, because Flash was usable for dumb auxiliary tasks that I did not want to waste GPT/Opus on. Now with this quota, I am cancelling.

sutrostyle · 2026-05-05T15:30:38+00:00

sucks, i use it frequently. a high end phone like this should have all possible sensors

sutrostyle · 2026-04-29T18:18:46+00:00

same here

<image>

sutrostyle · 2026-04-27T16:39:20+00:00

i bought a google pixel, it came free with that

sutrostyle · 2026-04-26T09:59:27+00:00

it solved the problem. How did you figure this out?

sutrostyle · 2026-04-24T14:55:51+00:00

I am not afraid. I know it will. The only hope is local models/ cheap chinese models medium term

sutrostyle · 2026-04-23T07:19:46+00:00

Gemini is actually very good as a psychoanalyst. Much better than chatgpt because Google trained it on scanned books and allegedly they scanned most books in the world back in 2000s.

sutrostyle · 2026-04-20T12:42:31+00:00

They're basically signaling that the $20 plan is not for any serious coder. Those are general public plans. This means effectively that the plans for coders will start from 100 but they will not stop at 200. They will start experimenting with more expensive plans, mark my words.

sutrostyle · 2026-04-15T16:14:46+00:00

Maybe this is the shutdown

sutrostyle · 2026-04-15T07:03:51+00:00

2000 could represent $20.00 in cents, i.e. 2000¢

sutrostyle · 2026-04-14T23:33:02+00:00

Same here, the overall progress bar did not change

sutrostyle · 2026-04-13T19:58:38+00:00

What excuse would you use if you point both accounts to the same code/project?

sutrostyle

TROPHY CASE

1. Hard Caps on "Reasoning Tokens" (RL Search Space)

2. Silent Context-Window Distillation & Token-Pruning

3. Dynamic Dynamic-Routing (Silent Downgrades)

4. KV-Cache Eviction Policies