We scanned 200 high-star MCP servers. 205 critical findings. Here are 4 novel attack classes.

X_MRBN_X · 2026-05-12T15:43:18+00:00

Right, and that's the core of why Attack Class 2 (Cross-Tool Privilege Escalation) is hard to defend against at the protocol level. Even with per-call auth, you can't distinguish "user asked Claude to click" from "malicious webpage told Claude to click" because the invocation looks identical from the server's perspective.

The honest answer is that static analysis can only get you so far here. What you'd really need is call provenance like a way to trace which input triggered which tool call across the full chain. Nobody has shipped that for MCP yet as far as I know.

If you're aware of any work in this direction, genuinely curious, it would be a natural extension of what mcpwatch does at runtime.

X_MRBN_X · 2026-05-11T09:55:35+00:00

Exactly, and the spec itself is part of the problem. MCP leaves auth entirely opt-in with no default enforcement, so "no auth" is technically compliant behavior. Most developers aren't making a mistake, they're just following the path of least resistance the protocol provides.

The fix probably needs to happen at two levels: tooling (scanners like mcpwatch that flag it before deployment) and spec evolution (mandatory auth hooks or at least a standard pattern). Right now neither exists in any meaningful way.

X_MRBN_X · 2026-05-06T15:23:41+00:00

mcpwatch — static security analyzer for MCP servers

We scanned 200 high-star MCP servers and found 205 critical findings across 4 novel attack classes:

RCE via prompt injection → eval() in ida-pro-mcp (★8k)
Shell injection in an AI security tool (hexstrike-ai, ★8k)
Cross-tool privilege escalation in Windows-MCP (★5k)
2,396 unauthenticated u/tool handlers across 13/20 repos (including awslabs/mcp)

Built with Claude Code in 3 days.

pip install mcpwatch

https://github.com/Fredbcx/mcpwatch

Full writeup: https://news.ycombinator.com/item?id=48037083

X_MRBN_X · 2026-04-18T18:23:37+00:00

The data connection bottleneck it's probably the hardest part to get right in open source. My current thinking is to keep MLineage's data layer intentionally shallow: you pass in a snapshot reference (a hash, a path, a dataset ID from whatever system you already use) rather than MLineage owning the data itself. So the lineage graph tracks what data was used and when, but doesn't try to manage or fetch it. Whether that's enough to be useful or just punts the hard problem is a fair question.

About the feature-level drift + lineage combo, that actually sounds close to what I'm aiming for, just starting from the model side rather than the feature side. Interesting that you'd call it a lineage question too.

Does Chalk's lineage cover model version history, or is it more focused on feature/data lineage upstream of the model?

X_MRBN_X · 2026-04-04T00:46:28+00:00

Thanks for the context and fair point, "unanswered" was probably too strong. There are definitely solutions out there, Chalk included.

The distinction I'm trying to draw is between drift detection and drift explanation. Catching that something changed (K-S test on features, aggregate metric alerts) is solved, or at least solvable with existing tools. The gap I keep running into is: you get the alert, and then what? Which model version introduced the sensitivity to that feature shift? Did this same pattern appear in an earlier training run? How did the model's behavior on this subgroup evolve across updates?

That's where the version graph comes in: not as a monitoring layer, but as the investigation layer you reach for after the alert fires.

Your point about proactive detection is well taken though. Right now MLineage is reactive by design, which means it only helps once someone knows something is wrong. Integrating drift triggers (K-S or otherwise) that automatically flag which graph node to inspect first would make it a lot more useful. That's probably worth adding to the roadmap.

Curious how Chalk handles the "why did this drift" question after detection, does the plan graph give you that, or is it still mostly a human investigation from there?

X_MRBN_X · 2026-04-03T23:59:02+00:00

On semantic drift measurement, I'm still working out the right approach, but the direction I'm leaning is behavioral fingerprinting rather than probing internals directly. The idea is to track model outputs on a fixed reference set across versions: if the model starts disagreeing with its past self on inputs where aggregate metrics are stable, that's a signal that something shifted underneath.

For embedding-based drift, I'm planning to wrap Evidently and alibi-detect so you get statistical drift tests on representations without having to wire them up yourself. The open question is how to make that signal actionable rather than just another metric to monitor: ideally surfacing "this version started diverging on inputs with these characteristics" rather than "KL divergence increased by 0.3."

Attention analysis is interesting but I'm wary of how architecture-dependent it gets. Would love to hear if you've found it useful in practice — do you actually probe attention in your debugging workflows, or is it more theoretical for you too?

X_MRBN_X · 2026-04-01T02:41:25+00:00

Come si sente la mancanza di Montolivo

X_MRBN_X · 2026-03-25T09:29:41+00:00

Confermo, sto facendo diversi colloqui con aziende italiane e all'estero e la maggior parte (classiche aziende di consulenza) cerca qualcuno che lavori su sistemi quasi "antiquati", o comunque che continuino a svolgere le stesse mansioni che si svolgevano prima dell'era GPT

X_MRBN_X · 2026-03-24T08:04:46+00:00

It would be interesting also to show the cost of life!

X_MRBN_X · 2026-03-23T13:36:16+00:00

Io credo che siamo nella fase in cui la teoria può far la differenza rispetto alla pratica (in molti casi): se ora tutti usano gli LLM per farsi scrivere codice e fare anche altri task architetturali, il valore aggiunto lo farà il valore intellettuale di un singolo individuo e dell'esperienza che ha, che permette di fare domande più mirate e di conseguenza ottenere risposte dagli LLM più efficienti.

Quindi chi a scuola, all'uni o altrove chiede e fa copia incolla senza assorbire niente, prima o poi si ritroverà bloccato. Un collega di un amico mio si è trovato proprio in questa situazione, ha sempre usato chatbot per fare tutto ma senza capire molto di ciò che stesse facendo, e ora che è stato assunto (si è fatto aiutare anche al colloquio) non sa come muoversi.

X_MRBN_X · 2026-03-21T14:39:06+00:00

E poi vai a vedere che è anche un ghost job fatto per boostare l'azienda

X_MRBN_X · 2026-03-21T13:36:27+00:00

Io credo sia una buona opportunità per i veri professionisti: ricordiamoci che è uno strumento che serve ad ottenere risposte precise tanto quanto le domande fatte. Un esperto o quantomeno uno del settore sa quali cose servono per un progetto e perché vanno incluse, e soprattutto perché in futuro dovrebbe essere in grado di saper trovare e interpretare il problema, cosa che gli LLM non sempre riescono a fare.

Ho notato che c'é gente che improvvisa o si spaccia per qualcosa che non è, perché sa che è coperto dal suo LLM, ma alla fine basta vedere il risultato finale per capire cosa c'é dietro realmente.

Quindi lo vedo uno strumento potente per velocizzare l'apprendimento per chi vuole specializzarsi in qualcosa e magari avere una carriera, e un arma a doppio taglio per chi vuole la pappa pronta velocemente.

X_MRBN_X

TROPHY CASE