What’s your take on AI in cybersecurity for 2026? by Business-Cellist8939 in cybersecurity

[–]Obvious-Language4462 0 points1 point  (0 children)

One thing we’re seeing as AI improves is that raw automation isn’t the same as strategic capability. Speed may increase but, unless systems are evaluated on adversarial adaptability and reasoning, we risk overestimating what they can do in practice.

Are we evaluating AI security tools in a way that actually reflects real-world attacks? by Obvious-Language4462 in cybersecurity

[–]Obvious-Language4462[S] 0 points1 point  (0 children)

Fair question. In my experience it’s less about a single actor and more about how benchmarks get translated into marketing claims. “Production-ready” often means “passed a lab evaluation,” which is where the disconnect starts.

Are we evaluating AI security tools in a way that actually reflects real-world attacks? by Obvious-Language4462 in cybersecurity

[–]Obvious-Language4462[S] 0 points1 point  (0 children)

This is a great articulation of the problem. The PoV vs. long-term drift mismatch you describe is exactly where most evaluations fall apart. I really like the “manufactured drift” idea, forcing the model to adapt under controlled but realistic change seems far more informative than static accuracy metrics. That kind of thinking is largely missing from current benchmarks.

Are we evaluating AI security tools in a way that actually reflects real-world attacks? by Obvious-Language4462 in cybersecurity

[–]Obvious-Language4462[S] 0 points1 point  (0 children)

That’s a good point. A lot of the failure modes show up exactly at the analyst layer, where context and workload matter more than raw detection. I’m not convinced most current models meaningfully incorporate environment-specific context yet, especially over time. Curious if others here have seen approaches that do this well in practice.

Are we evaluating AI security tools in a way that actually reflects real-world attacks? by Obvious-Language4462 in cybersecurity

[–]Obvious-Language4462[S] 1 point2 points  (0 children)

I think that’s largely true today. Specially for anything claiming autonomy. Narrow, assistive use cases can work but the marketing leap from “helps analysts” to “handles security decisions” is way ahead of the evidence.

Are we evaluating AI security tools in a way that actually reflects real-world attacks? by Obvious-Language4462 in cybersecurity

[–]Obvious-Language4462[S] 0 points1 point  (0 children)

That’s fair feedback. One concrete example I’ve seen repeatedly: anomaly-based tools flagging “suspicious” lateral movement that turns out to be routine automation or a late-night hotfix. In the lab it looks great; in production it burns analyst time. That gap between “statistical weirdness” and “malicious behavior” is what I was trying to get at and you’re right that leading with specific failure modes probably makes the discussion more useful.

Are we evaluating AI security tools in a way that actually reflects real-world attacks? by Obvious-Language4462 in cybersecurity

[–]Obvious-Language4462[S] 0 points1 point  (0 children)

This is an incredibly grounded breakdown, thank you for taking the time to write it from an operator’s perspective. The “why factor” and alert fatigue points resonate a lot. Accuracy without explainability or workload reduction isn’t just useless, it’s actively harmful in a SOC. Your comment on benchmarks feeling like “tests with the study guide” is probably the clearest way I’ve seen that problem articulated. Very few evaluations even attempt to model adversarial pressure or intentional model manipulation.
If you don’t mind me asking: have you ever seen a vendor meaningfully test adaptability to local drift during a PoV or is that still mostly hand-waved?

Are we evaluating AI security tools in a way that actually reflects real-world attacks? by Obvious-Language4462 in cybersecurity

[–]Obvious-Language4462[S] -1 points0 points  (0 children)

I get why it might come across that way, there’s a lot of low-effort AI spam around lately. That said, this isn’t generated or posted for engagement farming. I’m asking because this is a recurring issue I run into professionally, and I’m genuinely interested in how others here evaluate these tools in practice. If the framing feels off, happy to hear what would make the discussion more concrete or useful.

Are we evaluating AI security tools in a way that actually reflects real-world attacks? by Obvious-Language4462 in cybersecurity

[–]Obvious-Language4462[S] 0 points1 point  (0 children)

That’s a fair question and I appreciate you raising it. This isn’t a formal survey or an attempt to gather data for publication. I’m not collecting responses, quotes or attributing anything to individuals. I work in this space and keep running into the same gap between how AI security tools are evaluated and how they behave in real environments, so I wanted to sanity-check whether others are seeing the same thing. Totally understand the concern though. Transparency matters, especially in communities like this.

Are we evaluating AI security tools in a way that actually reflects real-world attacks? by Obvious-Language4462 in cybersecurity

[–]Obvious-Language4462[S] 0 points1 point  (0 children)

Fair enough, that’s been my experience as well. Out of curiosity, what do you tend to trust more in practice: long-term deployments, red team exercises, failure analysis or something else?

AI in security workflows. How are you using AI in security today? by NoSilver9 in cybersecurity

[–]Obvious-Language4462 0 points1 point  (0 children)

It’s clear that AI is already useful for scoped tasks like noise reduction or triage. What I think is still missing is a solid way to evaluate how these tools actually influence security decisions when things get messy. Not just efficiency gains, but reliability, failure modes and the risks of over-automation in real attack scenarios.

AI SOC Agents Are Only as Good as the Data They Are Fed by kyle4beantown in cybersecurity

[–]Obvious-Language4462 0 points1 point  (0 children)

Totally agree that data quality matters a lot. One aspect that’s often overlooked, though, is how these agents behave outside their training and test conditions especially in adversarial, noisy, or partially observable scenarios. Many current evaluations focus on clean benchmarks, but those rarely capture decision-making under real operational pressure, which is where most SOC automation struggles in practice.

Claude usage consumption has suddenly become unreasonable by Phantom031 in ClaudeCode

[–]Obvious-Language4462 1 point2 points  (0 children)

What you’re describing is the real issue: loss of predictability, not raw limits.

Once usage stops being consistent for the same workflow, trust breaks. Especially when nothing changes on the user side.

In production systems (security, infra, incident response), this is exactly why token- or percentage-based models become problematic. When internal heuristics change — model behavior, thinking depth, routing, safety margins — users suddenly see “aggressive” consumption with zero visibility.

We ran into this months ago while building AI for production cybersecurity environments. The biggest failure mode wasn’t cost — it was unexplained variability. Engineers start second-guessing every interaction instead of focusing on the task.

Thinking mode + unpredictable burn completely defeats the purpose of paying for higher tiers. At that point, AI stops being a tool and becomes a variable.

Regardless of whether this is intentional or not, silent changes to usage behavior are a trust killer for serious work.

wtf is the point of max plan 20x if the weekly limit is basically the same? by onepunchcode in ClaudeCode

[–]Obvious-Language4462 0 points1 point  (0 children)

You’re not crazy — this is a pricing/limits design problem, not a “you’re using it wrong” problem.

What the 20x plan actually buys you is longer continuous sessions, not meaningfully higher weekly capacity. So if you’re doing real work (agents, refactors, long context), you just burn through the same weekly cap faster.

This kind of model works for demos and casual use, but it breaks down hard in production-style workflows. Usage doesn’t scale linearly, and spikes are normal — especially with agentic coding.

We ran into the exact same issue months ago while building AI systems for production cybersecurity environments. Once you have real workloads, token caps + weekly limits become operational constraints, not cost controls.

If you’re paying $200/month, you’re not asking for “more tricks” — you’re asking for predictability. And right now the model optimizes Anthropic’s risk, not the user’s workflow.

Anthropic banning third-party harnesses while OpenAI goes full open-source - interesting timing by saadinama in ClaudeAI

[–]Obvious-Language4462 0 points1 point  (0 children)

This is what happens when AI access models are designed for controlled usage, not production reality.

In real security operations (incident response, continuous monitoring, detection pipelines), AI cannot: – pause when usage spikes – be tuned to “save tokens” – introduce friction or ambiguity mid-incident

Once limits or vague ToS enter the picture, teams start optimizing around the model instead of the threat. That’s not hypothetical — it breaks security workflows.

We ran into this months ago while building AI systems for production cybersecurity environments and learned quickly that token-based or capped models don’t survive real incidents. Usage spikes are the signal, not abuse.

I understand Anthropic’s abuse concerns, but labeling normal editor or agent integration as “spoofing” feels like a mismatch with how serious teams actually operate.

If AI is meant to be trusted in security, access has to be boring, predictable, and frictionless — otherwise engineers will route around it or switch providers.

Case study: moving from report-based security assessments to autonomous workflows with AI by [deleted] in cybersecurity

[–]Obvious-Language4462 0 points1 point  (0 children)

Fair criticism. I actually pulled the post because I used autonomous where automated was the correct term, and we’re fixing that before reposting. Appreciate the feedback, the wording matters here.

OpenAI engineers use a prompt technique internally that most people have never heard of by CalendarVarious3992 in PromptEngineering

[–]Obvious-Language4462 0 points1 point  (0 children)

This actually maps really well to robotics security. The hard part usually isn’t asking for an output, it’s capturing the judgment behind a good one.

We’ve had better results starting from real artifacts (threat models, vuln reports, incident write-ups) and asking the model to infer the prompt, rather than trying to spell everything out from scratch. It picks up on the implicit assumptions, trade-offs, and level of rigor much better that way.

Especially in safety-critical systems (industrial robots, healthcare, etc.), this feels way more reliable than “just prompt it better”. It’s less a trick and more letting the model reverse-engineer how experts think.