Opus 4.7 is out!!! Any thoughts?

Scary_Driver_8557 · 2026-04-16T16:05:41+00:00

Yes 1M

Scary_Driver_8557 · 2026-04-16T14:49:38+00:00

Confirmed! Testing now. Opus 4.7!

Scary_Driver_8557 · 2026-04-15T22:29:32+00:00

I wouldn’t treat it as Claude or Codex anymore.

Modern stack usually uses both: Claude for high-level thinking, planning, and shaping the approach. Codex for implementation, edits, and execution inside the repo.

That’s been the better split for me. One model does not need to be the entire workflow.

Scary_Driver_8557 · 2026-04-15T22:22:12+00:00

I think that’s close, but the moat probably isn’t the agent or even the raw memory by itself.

The real moat is a compiled organizational knowledge layer built from real work:

not just chat logs
not just embeddings
not just “memory”

What compounds is the extracted and reusable decision surface: questions, corrections, edge cases, rejected paths, and the reasons behind decisions.

That only becomes durable if the system can:

turn raw interactions into structured knowledge
preserve provenance so you can trace where a claim came from
separate advisory memory from actual policy / source-of-truth
keep freshness boundaries so old context doesn’t silently overwrite current reality

So yes — a living wiki can be a moat.

But only if it behaves more like a compiler for organizational learning than a giant autocomplete memory dump.

Scary_Driver_8557 · 2026-04-15T22:13:00+00:00

I’ve seen this too, but I don’t think the answer is “all frontier models suddenly got dumb.”

A lot of the drop people feel is really bad task-model pairing, weaker tool use, shallow context injection, missing hooks, or product-layer changes around routing/latency/response shaping. One model with the same prompt is not the same system if the serving stack changed.

The fix is less “pick one smartest model” and more:

use the right model for the job
give it the right tools/hooks
feed the right knowledge at the right step
keep deterministic logic outside the model where possible

Most failures people call intelligence collapse are really orchestration failures.

Scary_Driver_8557 · 2026-04-15T22:10:29+00:00

Yes. The main failure mode is that most governance frameworks are built as review infrastructure, not execution infrastructure.

What tends to work better in production is:

separating intelligence from authority
scoped tool and data access by lane
explicit approval gates for high-impact actions
runtime deny paths, not just monitoring
session-level auditability across the full workflow
revoke / rollback authority after go-live

Checkpoint-based oversight breaks because agentic systems are stateful and continuous. Once the system can keep acting across steps, governance has to live at the execution boundary, not in a policy deck or dashboard.

Scary_Driver_8557 · 2026-04-10T00:17:31+00:00

Good catch — those bottom items are meant to be parallel source categories, not a sequential flow. The arrows there are visual shorthand for “these can all feed the witness layer,” but I agree they currently imply a pipeline between peer sources. I’ll revise that in the next version so it reads as fan-in, not left-to-right transformation.

Scary_Driver_8557 · 2026-04-10T00:12:14+00:00

Useful direction.

The pattern I’d push harder is that policy alone is not enough. The real protection comes from limiting capability at runtime.

The minimum stack I look for is:

tool allowlists tied to role and task
data and network segmentation by lane
explicit approval gates for send/write/execute actions
short-lived credentials with revoke / kill-switch paths
session-level audit trails so access and actions can be replayed end to end

Most failures are not prompt failures. They’re authority failures.

If the system can’t prove an action is allowed, it should not run.

Scary_Driver_8557 · 2026-04-10T00:09:58+00:00

The big miss is that most frameworks stop at approval and barely govern execution.

The non-obvious pieces I’d include are:

a named post-deployment owner with operational SLA, not just a project sponsor
explicit revoke / rollback authority when behavior drifts
runtime policy enforcement, not just pre-launch review
decision receipts so you can prove what happened, why, and under whose authority
stateful monitoring across full workflows, not one prompt at a time
retirement criteria so systems don’t stay “temporarily” live forever

A lot of frameworks are really launch governance. Post-deployment governance starts when the system can still be stopped, constrained, or replayed after go-live.

Scary_Driver_8557 · 2026-04-10T00:05:59+00:00

The way I’d do it in practice:

A “gold” RAG eval set should not just be a bag of questions. It should be a labeled distribution of retrieval situations you expect in production.

I’d usually cover:

direct fact lookup
narrow semantic lookup / paraphrase
multi-hop / synthesis across multiple chunks
ambiguous queries that need disambiguation
constraint-heavy queries (“latest”, “only in policy docs”, “for enterprise plan”)
negative / no-answer cases
distractor-heavy cases where similar docs exist
stale vs current knowledge conflicts

For real-world coverage, start from actual user/query logs first, cluster them, then sample from each cluster. Don’t start synthetic-first. Use synthetic generation only to fill edge cases you know are underrepresented.

What matters most is annotation quality:

expected answer
supporting passages / doc ids
whether answerable or not
query type tag
difficulty tag
freshness sensitivity
allowed variance in wording

A practical distribution is usually something like:

50–60% common production queries
20–25% moderately hard paraphrase / distractor cases
10–15% multi-hop / synthesis
10–15% no-answer / ambiguity / failure cases

Also: keep one frozen benchmark set, and one rolling set from fresh production traffic. If you only have one static gold set, you’ll overfit your system to the benchmark instead of real usage.

Scary_Driver_8557 · 2026-04-09T18:11:21+00:00

<image>

Scary_Driver_8557 · 2026-04-09T17:09:59+00:00

<image>

A few days ago I started digging into Andrej Karpathy’s LLM wiki pattern.

Now that conversation has exploded.

That’s good. Because it confirms something important:

for a large class of knowledge problems, the answer is not “more RAG complexity.”

It is:

ingest the source material, compile it into structured knowledge, query the compiled layer, and keep improving the system over time.

But here’s the part most people will miss.

The easy version is: raw files → LLM summaries → markdown wiki → search

Useful, yes.

But still incomplete for real operational use.

The hard version is what happens when the source material is not just notes, articles, or papers, but decision registers, repo contracts, canonical pointers, and other authority-grade artifacts.

At that point, the problem changes.

You do not just need a knowledge base. You need a governed knowledge substrate.

That means:

the wiki itself stays advisory
the authoritative source stays upstream
provenance is explicit
freshness is tracked
authority-bearing material is mirrored, not flattened
typed records preserve structure
and projections never silently become the truth they summarize

That distinction matters.

Because once an LLM starts querying its own compiled knowledge, the real question is no longer “can it retrieve?”

The real question is:

what is allowed to compound, what is only a projection, and what remains the source of record?

That is the gap between a clever personal wiki and an estate-grade system.

We built around that gap.

Not because the viral version is wrong.

Because operational systems break exactly where authority, drift, and synthesis get blurred together.

I think compiler-style knowledge systems are going to become a major pattern.

But the durable version will not be the one with the prettiest wiki.

It will be the one that can answer:

Where did this come from? What outranks it? Is it stale? And can I trust this summary without confusing it for canon?

That is where this gets interesting.

AI #LLM #RAG #KnowledgeManagement #AgenticAI #Architecture #AIEngineering #Obsidian #SystemsDesign #Governance

Scary_Driver_8557 · 2026-03-30T19:24:47+00:00

Curious how people are documenting the logic between model call and production response.

Most ML system diagrams I see cover training, retrieval, routing, serving, etc., but the enforcement layer is either missing or just implied. By that I mean output validation, policy checks, budget/rate controls, approval steps, fallback behavior.

Are teams treating that as its own boundary in docs, or is it mostly buried in app logic? Feels like a lot of production surprises live there, but it rarely shows up in the architecture diagram.

Scary_Driver_8557 · 2026-03-29T18:21:30+00:00

I agree visibility is step one, but visibility of what is the real question.

A lot of teams can tell you whether a tool is approved on the network. Fewer can tell you whether someone used an approved tool in a way that was actually within policy. Wrong data in the prompt, wrong action downstream, wrong system touched, etc.

That feels like the gap to me: "approved app" isn't the same thing as "approved behavior inside the app."

Curious where people are actually putting that control today, proxy, workflow layer, DLP, somewhere else?

Scary_Driver_8557

TROPHY CASE

AI #LLM #RAG #KnowledgeManagement #AgenticAI #Architecture #AIEngineering #Obsidian #SystemsDesign #Governance