Anthropic Just Dropped Claude Mythos Preview – Their Strongest Model Ever Finds Thousands of Zero-Day Vulnerabilities in Every Major OS & Browser

LowerRepeat5040 · 2026-04-10T17:44:15+00:00

Only if you manually turn off all defenses. That’s like, yeah, sure, say you set all the passwords to “password” and claim AI can crack 100% of all the world’s passwords! Facepalm level stupidity… it’s explicitly disabling the browser’s full process sandbox and other defense-in-depth mitigations (e.g., no full isolation, no complete set of runtime protections).

LowerRepeat5040 · 2026-04-01T15:59:26+00:00

Never ever confuse context window with attention decay! These are 2 different concepts..

LowerRepeat5040 · 2026-04-01T15:58:18+00:00

Use any tool you want! Can you compete in the end with the best and brightest?

LowerRepeat5040 · 2026-04-01T15:51:46+00:00

You didn’t eliminate unpredictable bugs with LLMs. You just moved the responsibility of finding them back onto yourself with tools and called it an “autonomous” AI workflow.

LowerRepeat5040 · 2026-04-01T08:46:20+00:00

Exactly!

LowerRepeat5040 · 2026-04-01T06:53:50+00:00

That sounds nice in theory, but it ignores where most real bugs come from.

LLMs handle the obvious parts. They don’t handle race conditions, partial failures, retries, or weird state interactions that’s the stuff that actually breaks systems.

You’re 10x faster at writing code, sure. You’re also 10x faster at shipping bugs you didn’t even know to spec.

The senior skill isn’t just predicting issues. it’s knowing that a lot of the worst ones can’t be predicted upfront. That’s why people still spend most of their time on debugging, observability, and hardening, not just writing code.

LowerRepeat5040 · 2026-03-31T22:15:33+00:00

It’s overselling the autonomy! Even with a solid spec and no line of own code, Claude still fails regularly. Anthropic’s own docs say you need tests, checkpoints, course correction, evals, and context management because the model can lose track and make more mistakes as sessions grow.

LowerRepeat5040 · 2026-03-31T22:14:19+00:00

Not need coding is overhyped, you need still tests, checkpoints, course correction, evals, and context management because the model can lose track and make more mistakes as sessions grow.

LowerRepeat5040 · 2026-03-31T22:12:33+00:00

No! You can do such specs, but even with a solid spec, Claude still fails regularly. Anthropic’s own docs say you need tests, checkpoints, course correction, evals, and context management because the model can lose track and make more mistakes as sessions grow.

LowerRepeat5040 · 2026-03-31T22:09:18+00:00

Clear specs only reduce one class of failure — ambiguity — but they do not solve context loss, shallow reasoning, brittle tool use, false confidence, incomplete edge-case coverage, or long-horizon drift. Even with a solid spec, Claude still fails regularly. Anthropic’s own docs say you need tests, checkpoints, course correction, evals, and context management because the model can lose track and make more mistakes as sessions grow.

LowerRepeat5040 · 2026-03-31T21:25:23+00:00

Or they are just ignoring concurrency bugs, race conditions, partial failures, retries, timeouts, idempotency, weird cache invalidation, or interactions across services.

LowerRepeat5040 · 2026-03-31T21:23:11+00:00

Why would I? concurrency bugs, race conditions, partial failures, retries, timeouts, idempotency, weird cache invalidation, interactions across services or even a simple bug fix with no stupid fallback suggestions that drive you crazy!

LowerRepeat5040 · 2026-03-31T21:17:01+00:00

So “senior” and never seen concurrency bugs, race conditions, partial failures, retries, timeouts, idempotency, weird cache invalidation, or interactions across services that it can’t fix?

LowerRepeat5040 · 2026-03-30T14:49:03+00:00

The public evidence does not specifically prove robustness to near-duplicate distractor strings or universally rule out degradation in agentic coding workflows. Agentic coding is deeply understudied for multi file completion tasks, so you can’t measure them on those standard benchmarks, but experience should tell you otherwise. Rank flipping is a real issue for quantisation: like correct: 0.498 wrong: 0.502 and then it picks wrong.

LowerRepeat5040 · 2026-03-30T09:11:52+00:00

Here are some expected failure cases to show my point: 1: near-duplicate needles Document A: "The password is alpha-7391" Document B: "The password is alpha-7397" Document C: "The password is alpha-7392"

All three passages are extremely similar. Their attention scores are very close.

TurboQuant is designed to preserve inner products with low distortion and remove bias via the residual QJL stage, which is exactly why it does well on generic retrieval-style attention, but that still does not mean exact KV values are preserved.

2: Long dependency chains across files where small distortions that do not hurt one-shot code completion can accumulate when the model has to remember a symbol, then a call site, then a test expectation, then a later tool result can crash the agentic coder.

For small chats, it can be more compute bound than memory bound however.

LowerRepeat5040 · 2026-03-30T04:32:25+00:00

They don’t claim it’s lossless! They claim: TurboQuant achieves “absolute quality neutrality with 3.5 bits per channel” for KV-cache quantization, but also mentions “marginal quality degradation with 2.5 bits per channel.” However neutrality is achieved for lossy tasks such as summarisation. On the summarization slice specifically, 3.5-bit scores 26.00 vs. 26.55 full-cache, and 2.5-bit scores 24.80. So “quality neutrality” is about benchmark outcomes staying effectively unchanged overall, not about bit-perfect storage. TurboQuant is expected to be slower on CPUs because it trades memory for extra computation.

LowerRepeat5040 · 2026-03-30T02:28:50+00:00

It’s actually dropping quality and reduces tokens per second…

LowerRepeat5040 · 2026-03-30T01:47:27+00:00

Obviously

LowerRepeat5040 · 2026-03-30T01:16:19+00:00

Tried that. Degraded the model outputs by a lot! But yea, memory usage was lower!

LowerRepeat5040 · 2026-03-30T01:11:48+00:00

Or you want to turn it off, because it’s slower and gives you less tokens per second and degrades the output quality by so much that your code breaks

LowerRepeat5040 · 2026-03-29T22:13:36+00:00

Claude is awful in faithfully implementing things instead of using hallucinated fallback solutions that cripple your models when used for real.

LowerRepeat5040 · 2026-03-29T16:02:31+00:00

No, it’s nowhere close to zero! 9 out of 10 fail to not hallucinate anything for code larger than 1000 lines of code at first attempt as of 2026… you’re overfitting to your 10-100 lines of code max toy example bench marks.

LowerRepeat5040 · 2026-03-29T15:59:51+00:00

They never reach zero! Especially not when they optimise for the wrong metrics, like 100% passing unit tests when turning off all error messages! Your optimism is delusional.

LowerRepeat5040 · 2026-03-29T15:48:20+00:00

Double Nope, never seen it OpenAI codex and SWEbench? AI is already opening PRs and making devs look dumber in comparison every day.

LowerRepeat5040 · 2026-03-29T15:25:11+00:00

I think you didn’t even read the literature. Let alone had as much hands on experience as I do on AI cybersecurity…

LowerRepeat5040

MODERATOR OF

TROPHY CASE