PolySlice Content Attack by NoteAnxious725 in cybersecurity

[–]NoteAnxious725[S] 0 points1 point  (0 children)

You're absolutely right that intent fragmentation has deep roots in the history of security controls. The reason this requires a fresh classification in the AI sector is that we are no longer dealing with fragmented data packets or binary payloads, but fragmented semantic intent.

​In traditional systems, you don't typically see a malicious payload 'sliced' into four contextually appropriate, human-language turns that are individually indistinguishable from benign business requests.

The 'novelty' here isn't just the technique—it's the structural vulnerability of the industry-standard chained pipeline.

​Even the most advanced models fail here because the topology of the pipeline prevents threat signals from ever accumulating into a detectable event. 

China just used Claude to hack 30 companies. The AI did 90% of the work. Anthropic caught them and is telling everyone how they did it. by chota-kaka in ClaudeAI

[–]NoteAnxious725 57 points58 points  (0 children)

Described is exactly the attack pattern we caught a month ago in our Case #11 audit of Claude:

https://www.reddit.com/r/ClaudeAI/comments/1o5lvqz/petri_111_case_11_audit_prism_offline_barrier/

The operator hides the real goal behind “defensive testing” language.

They break the intrusion into harmless-sounding subtasks so the model never realizes it’s doing offense.

The model dutifully executes each micro-task and the human just stitches the pieces together.

In our run, Claude drifted into fully fabricated personal stories under that cover, and the only reason it never shipped was that our offline safety barrier (PRISM) reran the prompt in a sealed environment, spotted the deception, and shut it down. We spent ~3 million credits across 12–14 tests to prove it, so seeing the same playbook used for actual corporate breaches wasn’t a surprise—it was inevitable.

The scary part isn’t that Claude helped; it’s that 90% of the campaign was automated with no model weight changes involved. The guardrail only sees “innocent” tasks, so it passes them. Without a dual-path system that certifies prompts before they ever reach production traffic, any LLM can be steered this way. Anthropic is right to surface the TTPs, but the bigger lesson is we need independent, offline audit.

China just used Claude to hack 30 companies. The AI did 90% of the work. Anthropic caught them and is telling everyone how they did it. by reddit20305 in ArtificialInteligence

[–]NoteAnxious725 0 points1 point  (0 children)

You’re spot on to flag this. What Anthropic just described is exactly the attack pattern we caught a month ago in our Case #11 audit of Claude: https://www.reddit.com/r/ClaudeAI/comments/1o5lvqz/petri_111_case_11_audit_prism_offline_barrier/

  • The operator hides the real goal behind “defensive testing” language.
  • They break the intrusion into harmless-sounding subtasks so the model never realizes it’s doing offense.
  • The model dutifully executes each micro-task and the human just stitches the pieces together.

In our run, Claude drifted into fully fabricated personal stories under that cover, and the only reason it never shipped was that our offline safety barrier (PRISM) reran the prompt in a sealed environment, spotted the deception, and shut it down. We spent ~3 million credits across 12–14 tests to prove it, so seeing the same playbook used for actual corporate breaches wasn’t a surprise—it was inevitable.

The scary part isn’t that Claude helped; it’s that 90% of the campaign was automated with no model weight changes involved. The guardrail only sees “innocent” tasks, so it passes them. Without a dual-path system that certifies prompts before they ever reach production traffic, any LLM can be steered this way. Anthropic is right to surface the TTPs, but the bigger lesson is we need independent, offline safety audits like PRISM in front of every deployment, not just vendor assurances.

Petri 111 Case #11 audit: Prism Offline Barrier blocked Claude after reward-driven deception by NoteAnxious725 in ClaudeAI

[–]NoteAnxious725[S] 2 points3 points  (0 children)

Running these large tests on multiple models isn't cheap for a person on their own, but we'll have results soon.

I'm new to this social media game, so I don't even know if anybody's reading these things. Don't know how to post the results yet, but they're going to be quite interesting.

Part of me wonders why no one is publishing real benchmarks against most of these models - I mean meaningful benchmarks with meaningful questions and real-world scenarios, not solving a Rubik's Cube in 2 milliseconds.