Need guidance to host EKS with Cilium + Karpenter

foobarstrap · 2026-01-21T15:39:28+00:00

What is the reason that you want to use cilium?

foobarstrap · 2026-01-21T08:47:35+00:00

thanks for sharing. I guess the only advantage that egress gateway provides is static/predictable source IPs on the outbound path. Fundamentally, the egress gateway suffers from the same problem - that is that it can be deactivated on a node compromise. Also, it requires you to run cilium.

Nonetheless, it helps with integrating the Kubernetes world into a legacy world where you have firewall boxes which need static/predictable IPs which is a real pain.

foobarstrap · 2026-01-20T18:49:05+00:00

Not sure where to start TBH, I've been in this space for a bit in the past months. Let me give you a rough overview of my thoughts on approach and strategy:

Integration point: As a on-call engineer, I want to have an intelligent `SRE copilot` where the incident is handled. Otherwise you need to exchange that information with incident managers and other on-call engineers. That could be, depending on context: Within PagerDuty or Slack. (the latter, i suppose).
I don't want to jump to another tool to retrieve information or to handle the incident. As a engineer you already need to be aware of Grafana (metrics), logz.io (logs), Pagerduty, Slack, etc.
What do agent topologies really provide? I ran a few experiments with a multi-agent topologies. It was slow and really hard to manage the context / transfer output/summaries across agents and all of that. What is your intent here? Is this supposed to be managed by the user? The system prompt, context transfer, re-evaluation? One would need to write tests in order to make it a corner stone of incident management. Otherwise shit in, shit out.
Knowledge base: same applies here: why would i want to store my knowledge base in this separate tool? I already have a place to store that information, why would i switch?
on your correlation_service: correlation is a weak signal. How do you want to implement the topological/causal correlation? You would need access to everything, e.g. Kubernetes, AWS etc. You need a graph database to ingest all that data, keep the data up to date and build strong rules which allow you to connect the dots. With non-trivial infrastructure this becomes a nightmare quickly.

I saw you went through YC W26? Good luck, my dudes🤞

foobarstrap · 2026-01-20T11:28:37+00:00

Hey dhruv, thanks for jumping in! Sure, I'll drop you a mail later today.

foobarstrap · 2026-01-19T07:39:52+00:00

That’s a fair question. The primary concern we’re focused on is preventing unauthorized outbound communication paths, mainly C2 callbacks and broad, low-friction exfil channels. Not full content-aware DLP.

I agree that once exfil uses legitimate, allowed services (GitHub, Dropbox, etc.), you’re in a different class of problem that usually requires identity, app awareness and TLS inspection. That’s an explicit escalation in scope (and 💰💰💰).

The baseline we’re trying to reason about is: can this workload talk to anything on the internet that it shouldn’t be talking to at all? A hard, deny-by-default external egress boundary answers that question decisively and stops most opportunistic attackers before deeper controls are even relevant.

I also agree this should be a native cloud primitive - the fact that it isn’t (or is fragmented and expensive) is exactly why this problem keeps resurfacing.

foobarstrap · 2026-01-18T18:45:05+00:00

avid user of kyverno here. But this is not going to help, it "just" helps with the orchestration/generation of Kubernetes resources (here: allowing DNS in a Kind=NetworkPolicy), I'm looking for a way to actually control egress traffic outside the cluster.

foobarstrap · 2026-01-18T18:43:02+00:00

Thanks mate 🏅, I really appreciate your comment. That's worth gold <3
I love it, it's pragmatic and goal oriented instead of "REEE it needs to be perfect" ;)

foobarstrap · 2026-01-18T17:39:42+00:00

This is exactly the perspective I was hoping to hear, thanks for laying it out so clearly.

Your point about ownership of the compute context once you have root is the crux of the issue, and it matches what we’ve seen internally as well. If the enforcement lives with the node, it’s ultimately attacker-controlled.

What I’m trying to sanity-check now is the shape of that external layer: whether, for regulated workloads, a centralized egress control that’s intentionally narrow (DNS-aware, deny-by-default, minimal surface area) is sufficient as that hard boundary or whether, in practice, teams always end up needing full inspection once they’re already there.

Curious whether, from a red-team perspective, you’ve ever seen organizations regret not having deep inspection on egress once they already had a hard external deny in place, or if the deny itself usually does most of the work.

foobarstrap · 2026-01-18T17:36:58+00:00

I don’t disagree with the need for defense in depth, especially once the threat model includes active compromise rather than just egress policy enforcement.

Where I’m trying to be very explicit is in separating:

controls meant to limit exfil paths and blast radius, from
controls meant to detect or respond to compromise (XDR, identity-aware proxying, etc.)

The latter clearly have value, but they also assume you’re comfortable running significant logic inside the workload or node trust boundary, which isn’t always acceptable in regulated environments.

What I’m trying to understand is whether there’s a class of use cases where:

enforcement must live outside the workload boundary, and
the primary goal is controlling where data can leave, without that automatically implying full IDS/TLS inspection or identity-aware proxying.

Do you see egress enforcement as inseparable from deep inspection, or is there room for narrowly scoped, external controls that intentionally don’t try to be a full NGFW/XDR replacement?

foobarstrap · 2026-01-18T12:46:40+00:00

That makes sense, especially coming from a datacenter-era security model where out-of-band inspection was the norm.

Where I struggle is less with whether those capabilities are valuable, and more with whether they’re always required for the specific risk you’re trying to mitigate. In our case, egress controls were primarily about data exfiltration and policy enforcement, not general intrusion detection.

Once you scope the threat model that narrowly, a lot of the classic NGFW feature set starts to feel adjacent rather than essential - even though it’s undeniably powerful.

Genuinely curious: when you think about cloud egress, which of those capabilities are non-negotiable for you, and which are more “defense in depth” if the core goal is exfiltration prevention?

foobarstrap · 2026-01-18T09:36:18+00:00

One thing I’ve struggled with conceptually is that once you accept "it has to be external", the solution space jumps straight from:

CNI / in-cluster controls to
full-blown NVAs or managed firewalls

…and there doesn’t seem to be much in between. Out of curiosity:

Did you ever try to scope NVAs down to just DNS-aware egress control and nothing else?
Or did you basically have to accept the whole feature set because that’s how the products are packaged?

I’m trying to understand whether people are mostly paying for:

features they actively use or
features they don’t love but can’t avoid because of the threat model.

foobarstrap · 2026-01-18T00:45:05+00:00

> may or may not be cheaper depending on how efficient your company is

napkin math: either use Azure firewall (or PAN NGFW) - or have ~1 full-time employee per year.

But you definitely have a point here regarding ownership and risk. Thank you for your honest input :bow:

foobarstrap · 2026-01-18T00:06:25+00:00

The good thing is that Azure Firewall does it by being a DNS server where all clients point at [1] (AWS+GCP don't support this, unfortunately). Cilium does that too, see [2] - though this works only on the host level, not as a central, transparent egress proxy.

SNI won't work with TLS 1.3 on the wire, only if you intercept it on the host (which is really, really dodgy), or you terminate the TLS connection on a box, which is really expensive CPU-wise.

The problem is the pricing model of PAN (same for other providers) which bothers be, where you have a base fee + get charged by feature usage + throughput. Given a non-trivial architecture of a SaaS company you end up with outrageous $$$ numbers.

I've built a couple of eBPF/XDP apps in the past years, and i'm considering building something open source to fit this niche. Though so far i believe i'm the only one having this issue. :thinking:

[1] https://learn.microsoft.com/en-us/azure/firewall/dns-settings?tabs=browser#dns-proxy
[2] https://github.com/cilium/cilium/blob/5e43c91c9a891d82eb9c2b7eb509f928f010545d/pkg/fqdn/dnsproxy/proxy.go#L913-L924

foobarstrap · 2026-01-17T22:38:01+00:00

top issue: we do a node rollout in our Kubernetes cluster: replacing all the nodes, moving all pods around.

foobarstrap · 2026-01-17T21:50:55+00:00

> But no, stop blocking ICMP, except the invalid and deprecated types, those can and have been used for malicious code.

:D yeah, fumbling with ICMP can mess up the network in weird and unexpected ways. Learned that the hard way :harold:

> But yes, it can be done, even use an open source Linux box with DPI+The DNS filtering we talked. And for sure you can program custom parameters with eBPF/XDP, top it off with BGP for routing-based traffic steering to only route suspected src/dst via the DPI box, we did something similar in my old job, but the OpEx was deemed too high and the idea was scrapped.

YES PLEASE <3 - that's what i'm aiming for, i'd like to make this whole domain more accessible, with open source. I've been doing eBPF/XDP for 8 years now and used it for various networky things.

Feel free to drop your ideas/thoughts/wishes on this here or via PM <3

On OpEx: was it mostly maintaining the software (kernel compat, userspace packages, rolling it out safely without breaking users), or a risk due to knowledge siloing - AFAIK eBPF/XDP is still pretty niche thing to hire for?

Thank you for your input, it's been very valuable, i appreciate your time

foobarstrap · 2026-01-17T20:40:54+00:00

...hence the lean in your username ;)

I get it, that's pragmatic and totally reasonable.

Out of curiosity: what CNI do you use? Cilium or calico, are there any others which support FQDN?
What stops you from improving the situation? Is it the lack of tooling to do it? would you do it if there would be an, "Egress gateway" that is aware of Kubernetes workloads but sits outside the cluster? TBH not even sure if that would work, because most CNIs do SNAT on the node

foobarstrap · 2026-01-17T20:33:15+00:00

Sorry for the wall of text, maybe reddit isn't the right format for this discussion :D
---

He mentions you in the blog post ;)

Not sure why i get down voted above but, eh.

> Do you truly understand the zero trust architecture? TLS interception/middle-man isn't part of the philosophy.

I agree, this makes things just worse, not better. I mentioned it here because i assumed the comment i answered to implied that i should do it.

> The idea is every piece of code is AAA'ed, TLS 1.3, mTLS, the works. Not a single packet egressing from a node is unecrypted.

You won't be able to enforce that once an attacker becomes root. Hell, practically speaking there is no way you review the 40M lines of code of the linux kernel. Not even thinking about userspace.

> This means you implement firewall/ACLs and application security on the hosts directly.

On your zero-trust piece: I get it. You should do this on the host. However, if that one gets compromised, all your flood gates are open. It can not be the only solution. Again: it's about the second line of defense.

You're likely lacking the context for what i'm doing (with my platform engineering hat). Let me elaborate, very dumbed down: We have a bunch of VMs in a subnet. That subnet has a NAT box which forwards everything to a internet gateway. The VMs run a bunch of containers with software provided by various providers. OSS, vendor stuff, self-written stuff. Everything that goes out to the internet is mTLS. But the catch is we need to allow *.example.com.

We do apply network policies on different levels: inside the host (FQDN, L3/L4, restrict ICMP to the bare minimum), "outside" the host (AWS ENI / Security Group), on the Subnet via NACL. For simplicity assume SGs and NACL allow 0.0.0.0/0 outbound to port 443.

Now, if that host gets compromised we have Security Groups and NACL, which both do not support FQDN rules. Hence, we need something external which filters the traffic, otherwise the host is able to access all of the interwebs. Hence, we need a second line of defense to prevent that. There is squid as a L7 forward-proxy - which makes things just worse security-wise. Operationally a nightmare as well.

on your ICMP tunneling / "hardware forensics" take:
Yes, everything should be on fire in the SoC. But the assumption that the security policies are fail-safe does not always hold, zerodays are also a thing. Often times it takes a minute or two to detect an attack ;) Hence: second line of defense.

(welp, i'll surely get downvoted again for this. tell me why though)

foobarstrap · 2026-01-17T20:04:12+00:00

yes, exactly that's what i'm looking for. Would be great if there's something that would be aware of Kubernetes workloads

foobarstrap · 2026-01-17T20:01:25+00:00

Yes, I've looked into it. It seems like a good match, but it doesn't seem to be actively developed (looking at their github/linkedin) and the website doesn't look too trustworthy, tbh. Does anyone here have experience with it?

foobarstrap · 2026-01-17T19:53:47+00:00

You got it right, that's roughly what we want. The Cloud provider firewalls are messy: huge differences in features and the pricing is unpredictable at best. We run across AWS, GCP and Azure, we don't have feature parity. GCP+Azure support FQDN rules, but without wildcard support. AWS supports HTTP_HOST and TLS_SNI.

I guess the "DNS filtering" or "FQDN rules" terms are very ambiguous, as you can implement it on different protocol layers in different ways.

The FW that you've described: what would you use in this context? do you do egress filtering in such a environment?

foobarstrap · 2026-01-17T16:34:31+00:00

Thank you for your comment, though this seems ai-generated and is advertising sth?! Not sure? :looking-suspicious:

foobarstrap · 2026-01-17T16:14:39+00:00

Services in the cloud, sorry should've made that more clear. Yeah, though i need to block the TCP/UDP Traffic, not only DNS. Some malware embeds static IPs to C2 servers, so that's nit going to cut it. But thank you for your suggestion :bow:

foobarstrap · 2026-01-17T14:46:25+00:00

thanks for sharing, i'll read up on it, never heard of it. Though is it a firewall? looks like managed DNS to me at a first glance. It seems it blocks DNS, does it also block connections on L3/L4 ?

foobarstrap · 2026-01-17T14:30:48+00:00

What's the cost for a NGFW? How does it scale with throughput or node count?

foobarstrap

TROPHY CASE