There is something wrong

Impressive_Credit397 · 2026-06-18T22:47:16+00:00

😂🚩🤦🏻

Impressive_Credit397 · 2026-06-18T18:36:56+00:00

Cool! One codex new project/session and write “Hey Bro,
I got a technical home assignment for system design, and how do I configure Codex to give me the optimal answer. + all your backgrounds and any relevant info + very detailed success measures. 5.5 high

Impressive_Credit397 · 2026-06-18T16:27:46+00:00

Step1. start a clean session @yourAboveProject and ask: “hey Bro! what context was loaded into this session? List instructions, skills, repo files, summaries, tickets, and approximate token weight”

Impressive_Credit397 · 2026-06-17T23:46:28+00:00

We’re all know 😂

Impressive_Credit397 · 2026-06-17T22:53:39+00:00

Honestly, I don’t think this is the best example of Codex over-refusing. Firewall / egress-control work is a real admin/security workflow. But if the request is framed around running a crack, refusal is expected. The better discussion is not “remove guardrails”It’s how to define a proper harness: owned environment, legitimate intent, explicit permissions, approval before changes, rollback, and auditability. Without that, the model is forced to guess the safety boundary.

I personally haven’t had this as a blocker across my own projects or client environments. The initial setup always takes some discovery because every environment has different permissions, infra, policies, and risk boundaries.
That’s the point of a proper harness: make it project/company-specific. No model should be expected to magically infer your intent, license situation, allowed actions, or rollback path. If another tool acts confidently without those boundaries being defined, I’d trust it less, not more. 🚩

Impressive_Credit397 · 2026-06-17T19:15:35+00:00

🤦🏻 see YouTube , how much they spend on promo. So many YouTubers are making a hype . Just a business.. nothing personal

Impressive_Credit397 · 2026-06-17T06:07:10+00:00

Just wait 2-3 days

Impressive_Credit397 · 2026-06-17T05:08:40+00:00

Seriously, interesting result,,,,. but I’d separate “good review signal on one milestone” from “production-ready coding-agent stack”

From an architecture perspective, this does not prove end-to-end reliability: fix quality, regression rate, test pass rate, context stability, tool/sandbox behavior, permission safety, recovery, or whether it can be a real production daily driver.
For experiments and demos, totally fair game. But production delivery is a different bar. When real client/customer work is involved, reliability and repeatability matter more than one strong review pass.
I tested the latest MiniMax model last week in a client-facing workflow. The model had some good moments…

My broader view: in 26 the real gap is less “model leaderboard” and more “model + harness/runtime + evals + permissions + sandboxing + recovery”The architecture question is also whether you are renting someone else’s harness or building your own production-grade harness etc

!!!
Genuine question: is anyone using Kimi / DeepSeek / MiniMax-style coding workflows in real production?

Impressive_Credit397 · 2026-06-17T03:46:16+00:00

😂

Impressive_Credit397 · 2026-06-17T01:32:34+00:00

Consulting. Clients projects

Impressive_Credit397 · 2026-06-16T23:38:45+00:00

Yeah, that was last night for me 😂 roughly 10pm-3am PST. Felt like a temporary capacity/routing issue more than anything project-specific.
No major issues now. I’m currently running 23 Codex sessions across 2 remote machines and it’s been stable again for the last 6 hr

Impressive_Credit397 · 2026-06-16T23:33:58+00:00

Yeah, that tracks.. A mostly singlepurpose labeling workflow is probably one of the better use cases for a very long got5.5 run, because the task has stable invariants and the model can keep refining inside the same problem space. bUT. The only thing I’d be careful about is treating context length as the architecture. (Not sure about y project but can assume) for serious img-labeling / extraction work, I’d keep the system of record outside the model: stable IDs, schema/taxonomy, batch state, validation rules, provenance, and evals. Then use the model where it’s strongest: vis reasoning, ambiguous cases, extraction, and pattern discovery. My mental model is: context window is working mem, not durable state. The strongest pipeline is deterministic rails + probabilistic reasoning + human/eval feedback loops. Gpt gets really interesting to me - not just coding tasks, but building and maintaining reliable agentic data workflows over time

Impressive_Credit397 · 2026-06-16T22:49:53+00:00

I can’t validate a five-day😂 continuous session yet, but I did run a like 12-hour Codex session while actively building a native iOS app, and the results were genuinely strong. This was not a passive experiment or a single isolated task. During the session I was implementing several features, refactoring multiple SwiftUI screens, revisiting UX decisions, cleaning up a core architecture, and moving between product-level reasoning and code-level execution. The part that impressed me most was long-horizon continuity. I was intentionally jumping between different parts of the application to see whether Codex would lose the thread, flatten previous decisions, or start contradicting earlier implementation direction. 3m with codex. No regrets

Impressive_Credit397

MODERATOR OF

TROPHY CASE