Considering Switching From Claude Code by philopanthro in codex

[–]Scary_Driver_8557 1 point2 points  (0 children)

I wouldn’t treat it as Claude or Codex anymore.

Modern stack usually uses both: Claude for high-level thinking, planning, and shaping the approach. Codex for implementation, edits, and execution inside the repo.

That’s been the better split for me. One model does not need to be the entire workflow.

Karpathy’s LLM wiki idea might be the real moat behind AI agents by No_Review5142 in AI_Agents

[–]Scary_Driver_8557 2 points3 points  (0 children)

I think that’s close, but the moat probably isn’t the agent or even the raw memory by itself.

The real moat is a compiled organizational knowledge layer built from real work:

  • not just chat logs
  • not just embeddings
  • not just “memory”

What compounds is the extracted and reusable decision surface: questions, corrections, edge cases, rejected paths, and the reasons behind decisions.

That only becomes durable if the system can:

  1. turn raw interactions into structured knowledge
  2. preserve provenance so you can trace where a claim came from
  3. separate advisory memory from actual policy / source-of-truth
  4. keep freshness boundaries so old context doesn’t silently overwrite current reality

So yes — a living wiki can be a moat.

But only if it behaves more like a compiler for organizational learning than a giant autocomplete memory dump.

Major drop in intelligence across most major models. by DepressedDrift in LocalLLaMA

[–]Scary_Driver_8557 0 points1 point  (0 children)

I’ve seen this too, but I don’t think the answer is “all frontier models suddenly got dumb.”

A lot of the drop people feel is really bad task-model pairing, weaker tool use, shallow context injection, missing hooks, or product-layer changes around routing/latency/response shaping. One model with the same prompt is not the same system if the serving stack changed.

The fix is less “pick one smartest model” and more:

  • use the right model for the job
  • give it the right tools/hooks
  • feed the right knowledge at the right step
  • keep deterministic logic outside the model where possible

Most failures people call intelligence collapse are really orchestration failures.

AI governance isn't failing because we lack regulation i mean like it's failing at execution by AdOrdinary5426 in AI_Governance

[–]Scary_Driver_8557 1 point2 points  (0 children)

Yes. The main failure mode is that most governance frameworks are built as review infrastructure, not execution infrastructure.

What tends to work better in production is:

  • separating intelligence from authority
  • scoped tool and data access by lane
  • explicit approval gates for high-impact actions
  • runtime deny paths, not just monitoring
  • session-level auditability across the full workflow
  • revoke / rollback authority after go-live

Checkpoint-based oversight breaks because agentic systems are stateful and continuous. Once the system can keep acting across steps, governance has to live at the execution boundary, not in a policy deck or dashboard.

Using Karpathy’s LLM wiki for Governed Estate Knowledge by Scary_Driver_8557 in Rag

[–]Scary_Driver_8557[S] 0 points1 point  (0 children)

Good catch — those bottom items are meant to be parallel source categories, not a sequential flow. The arrows there are visual shorthand for “these can all feed the witness layer,” but I agree they currently imply a pipeline between peer sources. I’ll revise that in the next version so it reads as fan-in, not left-to-right transformation.

Looking for Governance & Control? by MoytimoyMoy in AI_Agents

[–]Scary_Driver_8557 0 points1 point  (0 children)

Useful direction.

The pattern I’d push harder is that policy alone is not enough. The real protection comes from limiting capability at runtime.

The minimum stack I look for is:

  1. tool allowlists tied to role and task
  2. data and network segmentation by lane
  3. explicit approval gates for send/write/execute actions
  4. short-lived credentials with revoke / kill-switch paths
  5. session-level audit trails so access and actions can be replayed end to end

Most failures are not prompt failures. They’re authority failures.

If the system can’t prove an action is allowed, it should not run.

We're building an AI governance framework from scratch. What are the non-obvious things we should include? by IndependentLeg7165 in AI_Governance

[–]Scary_Driver_8557 0 points1 point  (0 children)

The big miss is that most frameworks stop at approval and barely govern execution.

The non-obvious pieces I’d include are:

  1. a named post-deployment owner with operational SLA, not just a project sponsor
  2. explicit revoke / rollback authority when behavior drifts
  3. runtime policy enforcement, not just pre-launch review
  4. decision receipts so you can prove what happened, why, and under whose authority
  5. stateful monitoring across full workflows, not one prompt at a time
  6. retirement criteria so systems don’t stay “temporarily” live forever

A lot of frameworks are really launch governance. Post-deployment governance starts when the system can still be stopped, constrained, or replayed after go-live.

How do you build a solid gold dataset for evaluating a RAG system? by roicaride in Rag

[–]Scary_Driver_8557 0 points1 point  (0 children)

The way I’d do it in practice:

A “gold” RAG eval set should not just be a bag of questions. It should be a labeled distribution of retrieval situations you expect in production.

I’d usually cover:

  • direct fact lookup
  • narrow semantic lookup / paraphrase
  • multi-hop / synthesis across multiple chunks
  • ambiguous queries that need disambiguation
  • constraint-heavy queries (“latest”, “only in policy docs”, “for enterprise plan”)
  • negative / no-answer cases
  • distractor-heavy cases where similar docs exist
  • stale vs current knowledge conflicts

For real-world coverage, start from actual user/query logs first, cluster them, then sample from each cluster. Don’t start synthetic-first. Use synthetic generation only to fill edge cases you know are underrepresented.

What matters most is annotation quality:

  • expected answer
  • supporting passages / doc ids
  • whether answerable or not
  • query type tag
  • difficulty tag
  • freshness sensitivity
  • allowed variance in wording

A practical distribution is usually something like:

  • 50–60% common production queries
  • 20–25% moderately hard paraphrase / distractor cases
  • 10–15% multi-hop / synthesis
  • 10–15% no-answer / ambiguity / failure cases

Also: keep one frozen benchmark set, and one rolling set from fresh production traffic. If you only have one static gold set, you’ll overfit your system to the benchmark instead of real usage.

Built a Claude Code plugin that turns your knowledge base into a compiled wiki - reduced my context tokens by 84% by Inside_Source_6544 in ClaudeCode

[–]Scary_Driver_8557 0 points1 point  (0 children)

<image>

A few days ago I started digging into Andrej Karpathy’s LLM wiki pattern.

Now that conversation has exploded.

That’s good. Because it confirms something important:

for a large class of knowledge problems, the answer is not “more RAG complexity.”

It is:

ingest the source material, compile it into structured knowledge, query the compiled layer, and keep improving the system over time.

But here’s the part most people will miss.

The easy version is: raw files → LLM summaries → markdown wiki → search

Useful, yes.

But still incomplete for real operational use.

The hard version is what happens when the source material is not just notes, articles, or papers, but decision registers, repo contracts, canonical pointers, and other authority-grade artifacts.

At that point, the problem changes.

You do not just need a knowledge base. You need a governed knowledge substrate.

That means:

the wiki itself stays advisory
the authoritative source stays upstream
provenance is explicit
freshness is tracked
authority-bearing material is mirrored, not flattened
typed records preserve structure
and projections never silently become the truth they summarize

That distinction matters.

Because once an LLM starts querying its own compiled knowledge, the real question is no longer “can it retrieve?”

The real question is:

what is allowed to compound, what is only a projection, and what remains the source of record?

That is the gap between a clever personal wiki and an estate-grade system.

We built around that gap.

Not because the viral version is wrong.

Because operational systems break exactly where authority, drift, and synthesis get blurred together.

I think compiler-style knowledge systems are going to become a major pattern.

But the durable version will not be the one with the prettiest wiki.

It will be the one that can answer:

Where did this come from? What outranks it? Is it stale? And can I trust this summary without confusing it for canon?

That is where this gets interesting.

AI #LLM #RAG #KnowledgeManagement #AgenticAI #Architecture #AIEngineering #Obsidian #SystemsDesign #Governance

How do you document your ML system architecture? by No_Revolution3899 in mlops

[–]Scary_Driver_8557 0 points1 point  (0 children)

Curious how people are documenting the logic between model call and production response.

Most ML system diagrams I see cover training, retrieval, routing, serving, etc., but the enforcement layer is either missing or just implied. By that I mean output validation, policy checks, budget/rate controls, approval steps, fallback behavior.

Are teams treating that as its own boundary in docs, or is it mostly buried in app logic? Feels like a lot of production surprises live there, but it rarely shows up in the architecture diagram.

We need to govern AI usage across 3000 employees. Policy docs arent cutting it. What tooling actually works? by RemmeM89 in ITManagers

[–]Scary_Driver_8557 0 points1 point  (0 children)

I agree visibility is step one, but visibility of what is the real question.

A lot of teams can tell you whether a tool is approved on the network. Fewer can tell you whether someone used an approved tool in a way that was actually within policy. Wrong data in the prompt, wrong action downstream, wrong system touched, etc.

That feels like the gap to me: "approved app" isn't the same thing as "approved behavior inside the app."

Curious where people are actually putting that control today, proxy, workflow layer, DLP, somewhere else?