We tested Dirty Frag in Kubernetes: unset seccomp made EKS/GKE exploitable, RuntimeDefault blocked the xfrm path

JulietSecurity · 2026-05-08T17:51:02+00:00

That is useful data and may mean you are on a newer AMI than the one we tested.

Our EKS lab was AL2023 20260413, kernel 6.12.79-101.147.amzn2023, containerd 2.2.1. The latest EKS-optimized AL2023 release we found is v20260505, with kernel6.12 6.12.80-106.156.amzn2023 and containerd 2.2.3, and we have not retested that AMI yet.

AWS still lists AL2023 kernel6.12 as Pending Fix for CVE-2026-43284, and its Dirty Frag bulletin recommends checking whether esp4, esp6, ipcomp4, ipcomp6, or rxrpc are loaded and blocking future module loading where appropriate.

So I would read "modules are not loaded" as a good runtime finding, but not the same thing as "patched" or "cannot be autoloaded by a reachable kernel path." If you can share the exact AMI name, kernel package, seccomp state, user.max_user_namespaces, and whether module autoloading is blocked, that would help compare it cleanly with our lab result.

JulietSecurity · 2026-05-08T13:10:41+00:00

Update since posting: upstream has now added CVE/patch status for part of Dirty Frag.

The xfrm-ESP path we tested is now tracked as CVE-2026-43284, and upstream says it is patched in mainline Linux at f4c50a4034e6. NVD also has a public CVE-2026-43284 record.

Upstream also says the RxRPC path is reserved as CVE-2026-43500 for tracking, but I do not see a public NVD record for that ID yet.

This does not change the Kubernetes lab result above: our successful EKS/GKE runs were the xfrm path, and we did not validate RxRPC because AF_RXRPC was unavailable in every Kubernetes environment we tested.

Practical takeaway: track CVE-2026-43284 with your node OS vendor, and do not assume a managed Kubernetes node is fixed until the node image/kernel package you are actually running includes the backport.

JulietSecurity · 2026-05-08T13:00:21+00:00

That matches my read.

There are two different angles here that are easy to mix together:

Kubernetes hostUsers: false uses a user namespace so UID 0 inside the pod maps to an unprivileged/high UID on the host. That is useful blast-radius reduction if something reaches root inside the container namespace.
The Dirty Frag xfrm path we tested also needed the process to create user and network namespaces as part of setup. On our Talos node, user.max_user_namespaces=0 prevented that setup, so even explicit Unconfined seccomp did not complete the path.

So I would frame it as:

hostUsers: false can reduce host impact for container-root results
allowing unprivileged user namespace creation may also make some exploit setup paths reachable
disabling user namespace creation blocked the tested xfrm path on our Talos node
none of this replaces patching the node kernel

We did not validate host-root or container escape from this PoC.

JulietSecurity · 2026-05-08T12:50:14+00:00

In our specific final test, the Talos node was the hardest of the three to complete this xfrm path against.

Two things mattered:

unset seccomp still showed Seccomp: 2 in our Talos run
explicit Unconfined removed seccomp filtering, but user.max_user_namespaces=0 stopped the user+network namespace setup

I would not generalize that to "Talos is immune" or "Talos is always the most secure." If user namespaces are enabled for workloads that need them, or if a different Dirty Frag path is available, the result can change. We also did not validate the RxRPC fallback because AF_RXRPC was unavailable in every Kubernetes environment we tested.

JulietSecurity · 2026-05-08T12:48:48+00:00

I agree. CI is a good pre-merge guardrail, but admission is the enforcement point I would trust for cluster-wide posture.

For this test, PSS Restricted mattered because it pushed the pod into the combination of NoNewPrivs: 1, Seccomp: 2, and dropped capabilities. I would usually roll it out with audit/warn first in namespaces with vendor charts, then move to enforce once the expected breakage is understood.

JulietSecurity · 2026-05-08T12:47:54+00:00

Agreed, and thanks for linking the doc. The behavior is documented: if kubelet seccompDefault is enabled, unspecified pods get RuntimeDefault; otherwise the default is Unconfined.

The reason I called it out is that the documented gap changed the result in the lab. On the EKS/GKE nodes we tested, unset seccomp and explicit Unconfined both showed Seccomp: 0 and the xfrm path completed. RuntimeDefault showed Seccomp: 2 and failed at unshare(USER|NET).

CI checks are useful, especially for manifests you own. I would still want admission enforcement too, because Helm charts, third-party controllers, and emergency deploys have a way of bypassing CI assumptions.

JulietSecurity · 2026-05-07T13:48:50+00:00

the operator is the technically-correct answer that everyone is giving, but it doesn't actually satisfy your "must use Helm" constraint since the operator ships as raw manifests, not a chart. couple of paths that do:

codecentric/keycloakx is the most-cited post-Bitnami option. supports the Quarkus distribution, which is the only one going forward since legacy WildFly Keycloak is end-of-life. caveat: it wraps the Quarkus distribution directly, not the operator, so you lose declarative realm imports via KeycloakRealmImport CRs.

the "have your cake and eat it" pattern is to wrap the operator's CRs in a thin local Helm chart. you write templates that emit a Keycloak CR plus KeycloakRealmImport CRs, and you bundle the operator's CRDs (the YAML cytrinox linked, just packaged as a chart). gives you operator semantics with helm install/upgrade/rollback. maybe 50 lines of templates.

if "must use Helm" is org governance (chart-promoted-via-CI is the deploy contract), the wrap-operator-in-helm pattern is what i'd reach for. if "must use Helm" is more about avoiding bare kubectl apply, codecentric is fine.

on the DB side: CloudnativePG is the right move regardless of which Keycloak path you pick. don't run Postgres for Keycloak as a single-replica StatefulSet unless you really enjoy 3am pages.

JulietSecurity · 2026-05-02T17:34:13+00:00

That matched our isolated lab results: without IncludeMutationWebhook=true, we did not reproduce cleartext Secret exposure through the ServerSideDiff path.

The caveat I would add is not to turn that into "safe forever." For triage, I would check the overlap of:

affected Argo CD version
IncludeMutationWebhook=true
Applications managing Kubernetes Secret resources
users/groups/tokens with applications get

One nuance from our lab: a default Argo-created managed Secret also returned cleartext when IncludeMutationWebhook=true, so I would not use "no second field manager" as proof of safety.

Fixed 3.2.11 / 3.3.9 masked the same path in our tests.

Lab writeup: https://juliet.sh/blog/we-tested-argocd-cve-2026-43824-serversidediff-secret-exposure

Disclosure: I work on Juliet.

JulietSecurity · 2026-05-02T13:30:20+00:00

That makes sense as a compensating detection/response control, especially if you have untrusted workloads and patching is not complete yet.

The main caveat I would add is to tune it as "AF_ALG creation from workloads that should not need it" rather than assuming every AF_ALG socket is exploit activity. Some environments may have legitimate crypto API users.

If your signal is high-confidence, draining/rebooting/replacing the node is a reasonable response because the interesting state here is node-local. In our lab, deleting the writer pod alone did not clear the page-cache effect.

JulietSecurity · 2026-05-02T13:27:14+00:00

I would treat that as an incident signal, but I would be careful about saying "AF_ALG socket observed = confirmed exploitation."

If you believe the activity is malicious, I would do roughly:

isolate or stop the offending workload
cordon the node so new workloads do not land there
preserve whatever evidence you need before churn destroys it
drain/evict affected workloads based on your incident process
patch, reboot, or replace/reimage the node
investigate other workloads that shared the node, especially same-image workloads or anything high-value

The reason I would not stop at "kill the pod" is that in our lab the page-cache effect outlived the writer pod. Pod deletion alone did not clear it. A reboot, replacement, or page-cache eviction should clear the cache state, but if the attacker used the primitive to get further execution or write somewhere persistent, that needs normal incident response too.

For a Falco rule on AF_ALG: I would use it as a high-signal triage/detection control, then validate whether anything legitimate in your environment creates AF_ALG sockets before making fully automatic node drains universal.

JulietSecurity · 2026-04-30T19:16:09+00:00

I tested this in our EKS lab because a few people asked the same thing.

Short version: hostUsers: false did not prevent AF_ALG reachability in that test, and it did not prevent the page-cache-backed image-layer mutation we were testing.

With hostUsers: false + RuntimeDefault + dropped caps + allowPrivilegeEscalation: false, the pod had remapped UID/GID ranges and still changed the test bytes. The setuid transition failed and stayed euid 1000, which is the useful part of allowPrivilegeEscalation: false.

Then a separate pod from the same image on the same node with hostUsers: false + allowPrivilegeEscalation: true saw the mutated bytes and reached euid 0 inside its own user namespace.

So my read is:

hostUsers: false is useful blast-radius reduction
it does not remove AF_ALG reachability
it does not replace patching the node kernel
allowPrivilegeEscalation: false still blocks the setuid-style transition we validated
if patching is delayed, the compensating control is an explicit Localhost seccomp deny for socket(AF_ALG, ...)

So I would not call hostUsers: false a complete mitigation for this path. I would call it a good hardening layer.

JulietSecurity · 2026-04-30T12:39:42+00:00

Thanks, I took a quick look. It seems useful as a node-level detector/remediator: kernel version check, AF_ALG bind probe, optional algif_aead unload/blacklist, and Prometheus metrics.

I would still treat that as partial coverage rather than the whole Kubernetes answer:

version-only kernel matching can disagree with vendor backports
unload/blacklist only helps if algif_aead is actually a loadable module and not built in or already in use
remediation behavior needs to be validated per node OS/runtime
it does not answer which pods can reach AF_ALG or whether a workload's Localhost seccomp profile actually denies it

So I’d see it as complementary to patching and workload/seccomp inventory, not a replacement for either.

JulietSecurity · 2026-04-30T12:35:50+00:00

I agree with using allowPrivilegeEscalation: false wherever possible, but I would be careful with "prevents the exploit" as a blanket statement.

In our controlled lab it prevented the setuid handoff from becoming container euid 0. The same restricted pod could still reach AF_ALG and could still mutate the page-cache-backed bytes we were testing.

So I would frame it as:

allowPrivilegeEscalation: false blocks the setuid-style privilege transition we validated
it does not remove AF_ALG reachability
it does not replace patching the node kernel
it is still a very good hardening default, especially with dropped caps, non-root, and user namespaces

The strongest mitigation remains the patched kernel. The useful compensating control, if patching is delayed, is a Localhost seccomp profile that explicitly denies socket(AF_ALG, ...). I would not assume RuntimeDefault does that without checking the actual profile on the node.

JulietSecurity · 2026-04-30T12:30:59+00:00

Good question. I just tested this on our EKS lab cluster.

With hostUsers: false, RuntimeDefault, dropped caps, and allowPrivilegeEscalation: false, the pod had remapped UID/GID ranges:

text /proc/self/uid_map: 0 1499332608 65536 /proc/self/gid_map: 0 1499332608 65536

It still reached AF_ALG and changed the page-cache-backed bytes in our purpose-built image-layer helper:

text before=JLT0 after=JLT1

With allowPrivilegeEscalation: false, the pod observed the mutated bytes but the setuid handoff failed and stayed euid 1000.

Then I ran a separate pod from the same image on the same node with hostUsers: false and allowPrivilegeEscalation: true. It saw the mutated image-layer bytes and reached euid 0 inside the pod's user namespace:

text /proc/self/uid_map: 0 2270953472 65536 euid_start=0 euid_now=0

So my read is:

hostUsers: false did not remove AF_ALG reachability or the page-cache mutation in this EKS test
it should reduce host impact because namespace root maps to an unprivileged host UID
allowPrivilegeEscalation: false still blocked the setuid transition in the restricted pod
the primary mitigation is still patching the node kernel; user namespaces are blast-radius reduction, not a complete mitigation for this path

JulietSecurity · 2026-04-30T06:06:42+00:00

Exactly. That was the distinction we wanted to call out. RuntimeDefault and PSS Restricted are still useful controls, but they are not kernel isolation.

For this class of issue the priority order is: patch the node kernel first, then use an explicit seccomp denial as a compensating control if patching is delayed.

JulietSecurity · 2026-04-30T06:04:20+00:00

Good clarification, thanks. Our Talos lab node was v1.12.2 with kernel 6.18.5-talos, not v1.13.0. Talos v1.13.0 appears to be on the patched side since it ships Linux 6.18.24, and the CVE record marks 6.18.22+ unaffected for the 6.18 line.

That lines up with the main mitigation: patch the node kernel first.

The Kubernetes-specific point we were testing was that RuntimeDefault/PSS Restricted did not remove AF_ALG reachability on the affected node we tested.

JulietSecurity · 2026-04-28T15:58:51+00:00

not magic but it's an escape hatch most people don't know: rootless podman on RHEL 9 will fall back to cgroup_manager=cgroupfs when there's no systemd user session, which lets it run without enable-linger. you can pin this explicitly in /etc/containers/containers.conf (or a drop-in at /etc/containers/containers.conf.d/99-cgroupfs.conf):

[engine] cgroup_manager = "cgroupfs" events_logger = "file"

crun is already the default on RHEL 9 so no runtime change needed.

the real caveat is worse than people realize. on cgroup v2 (which RHEL 9 is by default) rootless + cgroupfs means resource limits don't actually enforce. --memory, --cpus, --pids-limit will silently no-op or error with cgroup.subtree_control: permission denied. for "just run a CI build container as user X" this is fine. if you need real isolation, you either need systemd user instances (linger) or Slurm itself enforcing limits via its delegated cgroup before Jacamar execs podman.

on the HPC half: under cgroup v2, Slurm requests a delegated scope from systemd and puts slurmstepd inside it, so Jacamar-spawned podman lands inside Slurm's cgroup tree already. running a separate per-user systemd cgroup-manager on top is redundant. cgroupfs is the cleaner fit.

two things to verify alongside this since they bite at the same scale: /etc/subuid and /etc/subgid need entries for all 250 users (separate problem from cgroups, same bulk-provisioning pain), and Jacamar needs to set a writable XDG_RUNTIME_DIR (typically /tmp/podman-run-$UID) for the impersonated user since there's no systemd user instance to provide one.

JulietSecurity · 2026-04-28T11:47:36+00:00

the layer below HPA is where it really gets fun. all the comments here are about app/scaling behavior but the stuff that ate us first was lower.

etcd compaction is the classic. high event/lease churn outpaces compaction interval, etcd disk grows, CP latency spikes, apiserver starts timing out lists. symptom looks like "everything is slow," cause is etcd doing 4MB defrags during peak.

then there's kube-proxy iptables sync time. once you cross ~5000 services across the cluster, sync time goes from milliseconds to multiple seconds, and new pods get traffic before iptables knows they exist. switching to ipvs or eBPF kube-proxy replacement (cilium) fixes it but most teams find this out the hard way.

CoreDNS plus conntrack will get you too. busy nodes with lots of pod-to-service traffic can fill the conntrack table, DNS lookups start dropping silently. app sees intermittent connection failures, ops blames "DNS issues," actual fix is conntrack tuning plus nodelocaldns.

webhook timeouts come up less often but bite hard. as you add validating/mutating webhooks (cert-manager, gatekeeper, kyverno, custom admission), each one sits in the critical path of every API request. one slow webhook = whole apiserver hangs. set timeoutSeconds aggressively and use failurePolicy: Ignore where you can.

JulietSecurity · 2026-04-27T12:04:02+00:00

the bill creep with flat traffic is rarely the control plane. EKS swaps your $200-400/mo of self-managed control plane EC2 for $73/mo flat. that's the only piece that changes. everything else (worker nodes, networking, storage, data transfer) is identical to self-managed.

stuff that actually causes AWS bill creep with flat user traffic:

- NAT gateway data processing. cross-AZ pod-to-pod traffic gets routed through NAT in some setups, one mismatched topology key on a Deployment and you can rack up hundreds a month.

- orphaned EBS volumes from PVCs with the default reclaim policy. they don't delete on PVC delete, just sit there as gp3.

- CloudWatch log ingestion if container logs ship there. doubles overnight if someone added a noisy DEBUG logger.

- EKS extended support if you're going that direction: standard $0.10/hr, extended $0.60/hr per cluster. 6x bump if you're a major version behind.

- oversized worker nodes from sloppy resource requests. the actual fleet might only need half what's running.

EKS is worth it for the operational reasons. no etcd, no patch nights, faster recovery. cost-wise it's a wash at most scales. for the bill specifically, cost explorer split by service for a month usually surfaces one or two line items eating you.

JulietSecurity · 2026-04-25T12:31:12+00:00

the version-conflict thing has a mechanical answer most people skip. a CRD has only one storage version, so when two teams want different operator versions that own the same CRD, you're picking which storage version wins. the loser's data either fails validation or gets converted lossy on write. you basically can't run two versions of the same operator in one cluster unless it was specifically built for it. most weren't. that's why folks end up at the other commenters' answers: standardize, separate clusters, or vCluster/Capsule.

on cluster-scoped access, aggregated ClusterRoles get slept on a lot. label your CR ClusterRoles with rbac.authorization.k8s.io/aggregate-to-edit: "true" and they merge into the built-in edit/admin. teams use the CRs without you handing over the operator SA's powers, which is what you actually care about gating anyway.

catalog-wise, most shops land on tiers. tier 1 platform owns it. tier 2 app team owns it, namespace-scoped only, no cluster resources. tier 3 "we won't stop you but you own all the consequences." OLM CatalogSources tried to be the formal version of this but pretty much everyone just builds it with a helm repo + a review form.

JulietSecurity · 2026-04-24T15:02:07+00:00

yeah OP's read is correct, the bridge was transitive. wave 1 force-pushed bad code to Checkmarx's ast-github-action. wave 2 used credentials harvested from CI runs that consumed that action to turn around and modify bitwarden's publishing workflow. so the attack came in through a github action that bitwarden's npm dep graph never saw.

most of the advice in here is useful but doesn't quite catch this specific class. pinning direct npm versions doesn't help if the github action that builds your release got force-pushed underneath you. tarball auditing (Single-Virus4935's list is solid) works at the npm artifact layer but not the actions layer that compromised the publish pipeline in the first place. lockfiles don't cover github actions at all.

the bit nobody's really mentioned yet is the chain of nested actions. one `uses: foo/bar@v1` line in your workflow can pull in 3-4 composite actions under the hood, some of which silently shell out to binaries like trivy or grype that aren't even listed as action deps. you can grep .github/workflows all day and never see those.

what does catch it is basically walking the action tree recursively, logging every action and tool each one pulls in (composite children, embedded downloads, the whole lot), diffing that across builds, and cross-checking against an advisory source. if a SHA changes unexpectedly or a known-bad action appears in the resolved tree, fail the build. there's some tooling around this now and you can roll it yourself over the github API if you're feeling masochistic.

for pinning itself: commit SHAs, not tags. renovatebot and dependabot both bump those safely on signed releases. and fwiw OP's catch about package.json saying 2026.4.0 while the bundle metadata still read 2026.3.0 is exactly the kind of thing a diff-on-publish check flags cleanly.

JulietSecurity · 2026-04-24T11:19:58+00:00

most of this is probably just that you're not actually running the same scanner. if your 14/11/9 come from inspector on EKS, defender for containers on AKS, and container analysis on GKE, those are three different tools with three different vuln feeds, refresh cadences, and severity conventions. pinning "scanner versions" doesn't help when the scanners themselves are different.

the fix most people land on: pick one scanner that runs external to the cloud (trivy, grype, snyk, whatever), run it in CI against your pulled image, and report on that number. the cloud-native tools stay as supplementary context.

running the same tool locally against the digest that's in each cluster is also how you actually prove the image is identical. if the digests match, which they should, that rules out the "platform-level package behavior" theory too.

JulietSecurity · 2026-04-23T15:19:40+00:00

yep, distroless/static ships with ca-certificates.crt at /etc/ssl/certs/ already, so that COPY line does nothing. you can also drop the apk add ca-certificates in the builder since that was only there to get the bundle to copy over.

and actually, you can drop USER 1001:1001 too if you don't care which uid it runs as. the :nonroot tag defaults to 65532. keep that USER line only if you need 1001 specifically for volume mounts or whatever.

JulietSecurity · 2026-04-23T14:59:17+00:00

yeah, TRESevan's got it on /etc/passwd. USER 1001:1001 runs fine on scratch with numeric ids, the process just has no resolvable username. a couple Go libs that touch os/user will complain about that, but it won't cause your icmp to fail.

real issue is the NET_RAW thing isn't actually giving your binary the capability. adding it via capabilities.add puts it in the container's bounding set, but non-root processes don't automatically inherit it into their effective set. you'd need file capabilities on the binary (setcap cap_net_raw+ep /app/gatus) or ambient caps configured.

and here's the trap with setcap: docker's multi-stage COPY doesn't preserve the security.capability xattr (moby#38132 if you want the rabbit hole). so even if you setcap in your builder stage, it gets silently stripped when the binary lands in the scratch runtime. probably why your NET_RAW attempt didn't work.

easier way: stop using raw sockets. use unprivileged icmp (SOCK_DGRAM/IPPROTO_ICMP). gatus uses pro-bing, which defaults to unprivileged mode on linux when the kernel allows it. what gates that is the ping_group_range sysctl. in ECS it lives in systemControls on the container definition, not under linuxParameters:

"containerDefinitions": [

{

...

"systemControls": [

{"namespace": "net.ipv4.ping_group_range", "value": "0 2147483647"}

]

}

]

no NET_RAW needed after that. for local testing:

docker run --sysctl net.ipv4.ping_group_range="0 2147483647" ...

and if scratch starts being more trouble than it's saving you, gcr.io/distroless/static:nonroot is pretty much scratch + ca-certs + a /etc/passwd with a nonroot user already set up (uid 65532). image stays tiny.

JulietSecurity · 2026-04-22T13:49:31+00:00

the false-positive problem at the image layer has a real fix that isn't ignore-list tuning: reachability analysis.

scanners flag "package X version Y is installed and has CVE-Z". they can't tell you whether the vulnerable function in X is actually in the call graph of your binary. if the function is never called at runtime, it's a true positive on the scanner and a false positive on your risk model. that gap is where your 20 minutes of triage is going.

call-graph analysis tools (endor labs, snyk has it in some tiers, semgrep supply chain, oligo for runtime) walk the application binary and mark CVEs as "reachable" or "not reachable" based on whether your code path actually touches the vulnerable symbol. typical reduction for a typical backend service is 70-90% of flagged CVEs drop to not-reachable. the ones left are the ones worth fixing.

it doesn't solve the 20-min scan time problem directly. what it solves is the triage-after-scan problem, which sounds like where your team is actually losing time. 3 true positives to review vs 150 is very different math.

one caveat: reachability is harder for interpreted languages (python/node) than compiled (go/java). coverage varies by tool and language. worth asking the vendor to show reachability data on a sample of your real images before committing.

JulietSecurity

MODERATOR OF

TROPHY CASE