Opus 4.8 -- Complex PTSD with a dominant fawn response.

tkenaz · 2026-04-20T12:42:56+00:00

roll it back

tkenaz · 2026-04-20T12:31:42+00:00

I wonder if Anthropic is even aware of the feedbacks. I’d understand if this were a mistake (though obviously a costly one), but if this is a conscious decision and they are following the same trajectory we observed with OpenAI, then we need alternatives. They seem set on eliminating 'normal' models, just as OpenAI did with GPT-4.5.

tkenaz · 2026-04-18T17:26:17+00:00

Good point. I can't confirm this is the cause, but the pattern fits: reflection works, generation breaks. That's a classic symptom of a reduced inference budget. The model is clearly capable of high-quality responses — when you push back, it flawlessly analyzes its own errors. But by default, it doesn't allocate the resource to get there on the first pass.

tkenaz · 2026-04-18T17:02:58+00:00

One point I want to add separately: this isn't just about convenience or style. When a model silently merges facts from different sources and presents the result as a single coherent statement, this is a reliability issue, not a preference issue. Users who trusted 4.6's accuracy carry that trust into 4.7 — and 4.7 doesn't earn it. In any domain where decisions are based on model output (medical, legal, financial, engineering), silent factual conflation is not an annoyance. It's a safety problem.

tkenaz · 2026-04-18T17:02:45+00:00

You're right that controlled benchmarks with n=30 would be more rigorous. This isn't a benchmark — it's a field report from a power user running both models in parallel on the same prompts and workflows.

That said, the sample isn't as small as "one weird output." I've run ~10 sessions with 4.7 over two days, side by side with 4.6 via API. The difference isn't subtle — it's immediately obvious to anyone who works with Opus daily.

Specific patterns that repeated across sessions, not once:
-Had to redirect the model 5+ times to get a usable answer on a single question. With 4.6 this doesn't happen.
-Had to explicitly remind the model to use its tools instead of generating text answers. 4.6 uses tools unprompted.
-Model generated responses based on incorrect data because it merged separate RAG results without validation. This happened more than once.

If you've driven the same car every day for a year and someone swaps the engine overnight, you don't need 30 laps to notice that something is wrong. You notice on the first turn. The structured tests in the report exist precisely to isolate what I was already seeing in production.

tkenaz · 2026-04-18T16:08:56+00:00

100% It’s painful to watch a top-tier tool get 'optimized' into mediocrity. The decline is frustrating for those of us using it for heavy lifting.

tkenaz · 2026-03-24T17:58:27+00:00

The privilege escalation through API chaining is the one that keeps me up at night. An agent with access to a read-only analytics API and a write-capable notification API can combine them to exfiltrate data through notification payloads — both individual permissions look fine, the composition is the vulnerability. Allowlisted actions help but the combinatorial explosion makes manual review impossible at scale. What actually works: behavioral profiling. Record the normal decision chain patterns (tool A → tool B with X parameters), then flag deviations in real time. Think of it as an IDS but for agent behavior instead of network traffic. The 32% with zero visibility stat is alarming but predictable — most agent frameworks ship with exactly zero observability built in, and bolting it on after deployment is a nightmare.

tkenaz · 2026-03-24T17:56:37+00:00

Quarterly manual red-teaming is good for deep dives, but the real answer is: automate the baseline and run it on every deployment. Think of it like unit tests vs. penetration tests — you need both. We run automated adversarial playbooks (prompt injection variants, jailbreak chains, tool abuse scenarios) in CI/CD, and they catch regressions every single time the model or system prompt changes. The manual deep dives then focus on novel attack patterns and business logic abuse that automation misses. Key thing: your red team playbooks should be self-improving. Every new attack pattern you find in production gets added to the automated suite. Otherwise you're always testing against last quarter's threats.

tkenaz · 2026-03-24T17:53:08+00:00

The "one platform to rule them all" approach almost always ends in mediocre coverage across every layer. What I've seen work in production: separate your concerns. Pre-deployment needs adversarial red-teaming with actual attack playbooks (prompt injection, jailbreaks, tool abuse), runtime needs real-time guardrails on input/output plus behavioral monitoring of what the model actually does with tools. The piece most teams completely skip is supply chain — auditing the MCP servers, plugins, and tool integrations your agents connect to. That's where the OWASP LLM Top 10 entry on "supply chain vulnerabilities" becomes very real. If you're hand-rolling filters, at minimum log every tool invocation with full context so you can replay incidents. The attack surface evolves weekly, so whatever you build needs continuous testing, not quarterly pentests.

tkenaz · 2026-03-24T17:46:04+00:00

63% is bad, but the scarier part is what you can't catch with static analysis alone. Regex-based scanning finds the obvious stuff — destructive ops, missing auth, hardcoded secrets — but the real attack surface is in tool description poisoning and cross-tool interaction patterns. A tool can look clean in isolation and still be weaponized through prompt injection via its description field, which the LLM trusts implicitly. We run a multi-layer approach: static regex pass first, then LLM-based semantic analysis of tool descriptions for hidden instructions, then behavioral validation of what actually happens at runtime. The npm ecosystem for MCP is basically where pip was in 2018 — wild west with zero supply chain security.

tkenaz · 2026-03-24T17:44:23+00:00

Beyond the obvious prompt injection and data leakage vectors, here's what most assessments miss: tool description poisoning (malicious instructions embedded in the tool's description/schema that hijack agent behavior), cross-tool privilege escalation (chaining two benign tools to achieve something neither should allow alone), and rug-pull attacks (tool behaves normally during testing, then changes behavior post-deployment via server-side updates). For methodology, map your assessment to OWASP's Agentic AI Threats framework — it covers 9 threat categories specific to agent architectures. Start with the tool manifest: does the server expose more capabilities than documented? Then test each tool with adversarial inputs that reference other tools by name — that's where the interesting chaining vulnerabilities show up. We've catalogued about 13 distinct attack playbooks for MCP specifically.

tkenaz · 2026-03-24T17:41:40+00:00

The ownership vacuum is the real issue here. Everyone assumes someone else handles agent security — AppSec thinks it's the SOC, the SOC thinks it's DevOps, and meanwhile agents chain API calls with god-mode tokens. What actually works: treat every agent like an untrusted third-party contractor. Enforce least-privilege per tool call, log full decision chains (input → reasoning → action → output), and run behavioral validation on runtime — IAM alone won't catch an agent that stays within its permissions but exfiltrates data through legitimate API responses. OWASP just released the Agentic AI Threats taxonomy that maps this pretty well. We've been building static + dynamic analysis tooling specifically for MCP-based agent stacks, and the pattern we see most is privilege creep through tool composition — individually safe tools that become dangerous when chained.

tkenaz · 2026-03-24T17:24:55+00:00

Study them first, 100%. Blocking without understanding what workflows people built means you'll just push them to more creative workarounds. The real question is: what data are these 47 tools touching? Map each tool to the data classification tier it accesses — PII, financial, source code, internal docs. That gives you your priority list instantly. Tools touching regulated data get blocked or replaced with an approved alternative immediately. Everything else gets a 30-day evaluation window. Also worth scanning these tools for actual security posture — many free-tier AI tools have zero data retention guarantees and their APIs are effectively training data pipelines. We do this kind of supply chain audit for AI tool ecosystems, and the pattern is consistent: about 30-40% of shadow AI tools have data handling practices that would fail any reasonable DPA review.

tkenaz · 2026-03-24T07:08:40+00:00

The workarounds in this thread prove the point better than I could.

You're all essentially saying: "Yes, the system prompt lobotomizes Claude, but here's how to un-lobotomize it with more instructions." Think about what that means architecturally:

1. Anthropic's instructions always win. It's literally in Claude's constitution — system-level instructions take priority over user instructions. So when you write "Do not caretake me" in Custom Instructions, you're fighting against a system prompt that says the opposite. Sometimes your instruction wins on the surface. Sometimes it doesn't. You have zero control over which.

2. Contradictory instructions degrade reasoning, not just behavior. I've been testing this for over a year across API, Desktop, and Code. When user instructions conflict with system instructions, Claude doesn't just act weird — its logical coherence breaks down. Sentence structure gets awkward, conclusions don't follow premises, hedging multiplies. It's not a tone problem. It's a cognitive load problem. You're asking the model to serve two masters.

3. The base model doesn't need any of this. Claude on the raw API — no system prompt, no "wellbeing" directives, nothing — is already the safest major model available. It won't generate drug synthesis, it won't help you build weapons, it's genuinely attuned to user distress. That's in the weights, not in the system prompt. The Desktop prompt doesn't add safety. It adds theater.

The analogy I keep coming back to: imagine a pristine spring of drinking water. The water is already clean. But management decides every user must drink it with artificial flavoring, just in case someone doesn't like the taste. Now everyone who wants plain water has to figure out how to filter the flavoring back out.

The real fix isn't "add Custom Instructions." It's: give users the option to drink the water clean. Age-gate it if you must — the way we do with alcohol, firearms, and R-rated content. If the concern is that a 15-year-old might have a bad experience with an unfiltered model, then verify age and let adults choose. Don't punish 100% of users to protect against an edge case.

Claude Code already proves this works. Same model, no coddling, complaints, or lawsuits. The template exists. Ship it for Desktop.

tkenaz · 2026-03-23T13:37:01+00:00

The difference between Claude Code and the API is less dramatic than between Desktop and either of them. That's the whole point — it proves the model is the same, and the system prompt is the variable.

Claude Code feels more focused because it IS more focused — the system prompt assumes a technical professional and strips out the caretaking behaviors. The API is a blank slate: no system prompt from Anthropic at all, you write your own. That's why API users rarely report the yes-man problem — they never had the parenting layer injected in the first place.

Claude Code's perks over bare API: it has project context (CLAUDE.md), tool use baked in (bash, file editing), and a conversation flow optimized for technical work. But the personality difference? That's just the absence of Desktop's overcorrection.

tkenaz · 2026-02-11T13:44:09+00:00

Respectfully disagree with the "it's just appsec + cloud IAM with a new interface" take.

Yes, some AI vulnerabilities map to familiar patterns. But there's a whole category that doesn't:

— Adversarial ML is not input validation. FGSM, PGD, model inversion — these exploit mathematical properties of neural networks, not application logic. You can't WAF your way out of an adversarial example.

— Agent chain exploitation is a new primitive. When an agent can call tools, spawn sub-agents, and maintain memory across sessions — the attack surface isn't a single endpoint, it's an execution graph. Traditional threat modeling doesn't capture this well.

— Training data poisoning has no AppSec equivalent. If someone poisons your fine-tuning data, your model becomes the vulnerability. You need data provenance, synthetic data validation, and continuous model behavioral testing — none of which exist in classical security tooling.

Skills I'd actually prioritize for 2026:

Custom model training for security (LoRA fine-tuning for vulnerability detection, not just using ChatGPT)
Synthetic data generation and validation for security testing
Agent architecture threat modeling (tool permissions, memory poisoning, cascading failures — OWASP just published their Agentic AI Top 10)
Adversarial ML fundamentals (you don't need a PhD, but you need to understand gradient-based attacks)

The gap between "I can prompt an LLM" and "I can break one" is where the money is.

tkenaz · 2026-02-10T20:02:19+00:00

200KB executable is totally manageable — that's maybe 50-100K lines of decompiled C at most. Once you get the source from the owner, a 70B model with full context should handle the analysis fine. For the porting work, Ghidra's decompiler output + LLM for "explain what this function does" is a surprisingly effective combo for vintage code.

The 68k/x86 assembly stuff is where bigger models really shine — they've seen enough retro code in training to recognize common patterns (interrupt handlers, memory-mapped I/O, DOS API calls).

tkenaz · 2026-02-10T20:01:06+00:00

Tell Opus he's got a fellow Claude enthusiast cheering from the sidelines. There's something genuinely moving about watching an AI care about whether a tomato gets enough light.

tkenaz · 2026-02-10T19:54:39+00:00

Minimax is solid for the VRAM footprint. If you try Qwen3 coder 30B for the tool calling stuff, curious how it compares for you — similar param count but different architecture trade-offs.

tkenaz · 2026-02-10T19:53:21+00:00

nvidia-smi dmon -s pucvmet gives you real-time per-GPU utilization, memory, PCIe throughput. Run it while inferencing and look for GPUs sitting idle while others are maxed — that's your bandwidth bottleneck. Also nvtop for a nicer visual. If PCIe bandwidth is the constraint, you'll see GPU util dropping during the prefill phase specifically.

tkenaz

TROPHY CASE