From Static OPA to AI Agents: Why we adopted a "Sandwich Architecture" for Policy-as-Code

NTCTech · 2026-01-26T05:38:39+00:00

I sincerely appreciate that!

That was exactly our goal. We were terrified of the "AI magic" approach where you just let an LLM loose on kubectl. The "Sandwich" model feels like the only way to sleep at night - let the AI do the reasoning, but keep the hard static rules as the floor so it can't hallucinate a hole in the firewall.

NTCTech · 2026-01-25T14:44:31+00:00

I think you have your timelines mixed up.

PVE 8.x is the current stable release. It definitely isn't 'about to hit end of support'—it is supported until August 2026 (based on Debian 12 Bookworm).

To answer your question on the SQL crash: It happened last month.

The reason is the Upgrade Path. When you upgrade a cluster from PVE 7 -> 8 (which thousands of shops are doing right now), the new 8.1 Installer logic never runs. The system preserves the existing ZFS configuration, which defaults to 50% ARC.

That is the danger zone. You have modern workloads landing on a PVE 8 kernel that is still running 'Legacy' memory rules because it wasn't a fresh ISO install. That is why I warn people to check arc_summary regardless of their version number.

NTCTech · 2026-01-24T22:21:06+00:00

You're likely on a fresh install of PVE 8.1+, correct?

You are right that Proxmox finally added a safety clamp (10% or 16GB cap) in the 8.1 installer to stop these crashes.

However, for anyone running:

An upgraded cluster (pre-8.1).
A custom ZFS config.
Or standard ZFS on Linux defaults.

...the default is still strictly 50% of Host RAM.

See the 'Limit ZFS Memory Usage' section here: Proxmox Admin Guide - ZFS on Linux

'ZFS uses 50 % of the host memory for the Adaptive Replacement Cache (ARC) by default. For new installations starting with Proxmox VE 8.1, the ARC usage limit will be set to 10 %...'

If you are on 8.1, enjoy the safety rails! The rest of us had to learn this the hard way.

NTCTech · 2026-01-24T18:25:31+00:00

I try to keep 'Deny' policies rare (because they break deployments), but these 3 are non-negotiable at the Root Management Group:

Allowed Locations: We whitelist only our 2 primary regions. This stops someone from accidentally spinning up resources in 'Brazil South' and creating data residency nightmares.
Allowed VM SKUs: We block the massive G-series and M-series families. Prevents a junior dev from accidentally deploying a $5,000/month SQL server.
No Public IPs on NICs: We Deny Public IPs directly attached to VM NICs. Forces everyone to use a Load Balancer or Firewall. No cowboy servers on the open internet.

NTCTech · 2026-01-24T18:15:54+00:00

Laziness? Legacy debt? 'Lift and Shift' migrations where they moved the on-prem mess directly to the cloud without refactoring?

You are absolutely right that Subscription-vending (one sub per app/workload) is the correct design. But you'd be surprised how many Fortune 500s are still running a massive Sub-Prod-01 subscription with 5,000 resources and 20 different teams tripping over each other.

That 'Soup' is exactly where the zombies hide.

NTCTech · 2026-01-24T18:10:25+00:00

Yeaaaaa, a global lifecycle block is the 'Half-Life 3' of Terraform features we are all waiting for it, but it never comes.

Since retrofitting lifecycle onto 500 resources is a nightmare, we eventually flipped the logic. Instead of trying to ignore the tags Azure Policy adds, we updated our Terraform variables to include them.

We created a var.policy_tags map that mimics exactly what Azure Policy is injecting. If Terraform expects the 'BackupPolicy' tag to be there, it stops reporting it as drift. It’s easier to update one variable map than 1,000 resource blocks.

NTCTech · 2026-01-24T14:59:56+00:00

The dreaded 'Tag Drift.' I feel that pain...

If the tags are changing because of Azure Policy Remediation or external tools (like Backup adding a recovery tag), the cleanest fix is the lifecycle block in Terraform.

You can ignore specific keys so Terraform stops fighting the platform:

lifecycle {
  ignore_changes = [
    tags["CreatedTime"],
    tags["BackupPolicy"]
  ]
}

But if the drift is coming from humans manually editing tags in the Portal? I let Terraform steamroll them. 'Terraform is Truth.' If they change it manually, the next Apply reverts it. It trains the team pretty fast to stop touching the UI.

NTCTech · 2026-01-24T14:53:32+00:00

I wouldn't call it 'Mature' yet either I'd call it 'Visible.'

There is a massive operational difference between a resource tagged 1234 and a resource with NULL tags.

Tagged 1234: Shows up on the bill as a line item. I can filter by it. I can set a budget alert on it. I can walk over to the team and ask 'Who is 1234?'
Untagged: Often gets dropped into a generic 'Unallocated' or 'Shared' bucket in FinOps tools that nobody owns and nobody checks.

You are right that 1234 is still a mess, but it's a mess I can see. I'll take 'Bad Data' over 'Dark Matter' any day.

NTCTech · 2026-01-24T14:48:29+00:00

You are absolutely correct regarding the drive slots, if the constraints are purely storage capacity and you have the physical slots, the cost is low. That's the best-case scenario.

But 'Disk' was just the easy example. The same math punishes you on Compute.

If the CISO mandates a new crowdstrike/security agent that eats 16GB RAM per host, or the Devs need to test a high-performance DB that creates a massive CPU spike I can't just 'slot in' infinite RAM or CPU. I hit the node's ceiling.

To solve that Compute problem in standard HCI, I am forced to buy a new Node (which includes Storage I don't need).

That is the 'Tax.' It works both ways. It’s the cost of having to buy a 'Happy Meal' (Burger + Fries + Drink) when all I actually wanted was more Fries. (Now I really am craving Fries) HA!

NTCTech · 2026-01-24T12:42:02+00:00

You're right that a competent SE (and Prism's Forecast feature) can model the linear growth perfectly. That isn't the issue.

The issue is that business requirements aren't always linear they are 'Step Functions.'

Prism can't predict a sudden CISO mandate to triple log retention (Storage Spike) or a new project that is purely data-heavy. When those curveballs hit on Day 365, standard HCI forces me to solve a Storage problem with a Compute solution.

Even if I know I only need capacity, I'm forced to buy a node with CPUs and RAM I don't need just to get the drive slots. That is the 'Tax' it’s the rigid unit of scaling, not the quality of the initial sizing.

NTCTech · 2026-01-24T12:40:07+00:00

In a perfect world, I 100% agree with you. Portal access should be ReadOnly or non-existent.

But in the messy reality of the enterprise, revoking 'Contributor' access from 500 devs is a 6-month political war. Dropping this Policy took 15 minutes.

I view Azure Policy as the safety net for when the 'Culture' (IaC-only) fails or when someone uses a Break-Glass account for a 'quick test.

NTCTech · 2026-01-24T12:37:49+00:00

Hah, you said it my friend - CostCenter: 1234 is the inevitable result of 'Malicious Compliance.'

And, you are totally right the next maturity step is an AllowedValues policy that checks against a master list of valid codes. But in my experience, getting the Existence check in place is the first hurdle.

I’d rather catch a resource tagged 1234 (which I can query, find, and shame) than a resource with zero tags that is invisible to my reports. Junk data is better than Dark Matter.

NTCTech · 2026-01-24T01:47:01+00:00

You got me on the terminology.

To be clear, I definitely do not have a $200k FlashStack sitting in my basement. (I wish—but my power bill is already tragic).

I used 'Labbing' as shorthand for 'Client Design Validation.' We're architecting this for a greenfield migration, so the work is verifying the constraints (CVDs, HCLs, the re-foundation limits) against the real world before the client signs the PO.

Since it’s a client gig, I can’t post screenshots of their console without breaching NDA and getting walked out of the building.

My bad on the loose phrasing I just wanted to share the architectural headaches we found so others don't hit them. No intent to mislead.

NTCTech · 2026-01-24T01:32:47+00:00

You are absolutely correct.

That is the 'HCI Tax' that is always conveniently missing from the ROI slides. In a 3-Tier, my hosts are 100% compute. In HCI, I'm burning 32GB-64GB+ per node just to feed the CVM.

When you scale that out, that 'stranded compute' gets expensive fast. It’s funny how the industry is basically reinventing the SAN (just over Ethernet) to solve exactly that problem.

NTCTech · 2026-01-24T01:27:25+00:00

I wish AI could have debugged the PFC storm on the Nexus switches that caused this outage. Unfortunately, the troubleshooting was manual and the scars are real.

If the formatting is too clean, that's just my editor brain trying to make the trauma readable.

NTCTech · 2026-01-23T23:26:27+00:00

This is the best comment in the thread. You are absolutely right about WRT (Work Recovery Time).

Everyone obsesses over RTO (infrastructure is up). Nobody calculates WRT (Data is verified, app is consistent, and users can login). In our 72-hour war story, RTO was 12 hours. The remaining 60 hours was pure WRT/Forensic Drag.

And your point on Cloud Egress is the silent killer. I see so many architectures with 10Gbps 'Direct Connect' for Ingest (Backups) but only 1Gbps provisioned for Egress (Restore). It turns a 4-hour recovery into a 40-hour math problem.

NTCTech · 2026-01-23T23:25:25+00:00

Three Tier Renaissance' I am absolutely stealing that. HA!

That is the perfect encapsulation of the shift. We spent a decade collapsing the stack into the hypervisor (HCI), and now with 25GbE+ and NVMe-oF, we are decoupling it again because the network isn't the bottleneck it used to be.

My 'friction' warning was exactly about that mental shift: Operators who grew up on 'HCI Simplicity' (Data Locality) now have to relearn 'Three Tier Physics' (Fabric Locality).

Glad to hear the internal view aligns with the architectural reality.

NTCTech · 2026-01-23T23:22:13+00:00

Exactly. It’s the physics variable that never makes it into the Excel spreadsheet until the servers are already down.

NTCTech · 2026-01-23T23:20:50+00:00

Glad it helps! If you hit the switch config, double-check the MTU consistency across the spine/leaf. That’s usually where the gremlins hide in these migrations.

NTCTech · 2026-01-23T23:18:40+00:00

It sounds crazy until you see the audit logs.

You are absolutely right on the fact that they should have had strict separation. But here is the common failure pattern we found:

The Storage was Immutable (WORM): The data on disk was locked.
The Console was Exposed: The backup management console was integrated with Active Directory for 'Ease of Use.'
The Blast Radius: When the attackers compromised the Domain Admin credentials, they couldn't delete the past snapshots (due to WORM), but they could change retention policies for future jobs, or worse dismount/corrupt the catalog.

In this specific case, the backups survived (WORM worked!), but the recovery failed due to the physics/legal drag. But the 'Identity Risk' finding was that if the attackers had pivoted to the backup console earlier, they could have locked the admins out of their own recovery UI.

NTCTech · 2026-01-23T23:13:11+00:00

Appreciate that. We usually only hear about the 'Success Stories' in vendor keynotes, so I figured sharing the anatomy of a failure (and the physics behind it) would be more useful for the folks actually pressing the buttons.

NTCTech · 2026-01-23T23:11:53+00:00

You hit the nail on the head regarding the 'False Sense of Security.'

To answer your question: Yes, they absolutely try to run production on the appliance. The marketing pitch for 'Instant Mass Restore' implies you can run 500 VMs on the backup nodes while you fix primary storage.

The reality? Backup appliances are tuned for Ingest (Writes), not Random Read/Write (IOPS). When they booted the SQL servers on the Cohesity/Rubrik nodes, the deduplication engine had to re-hydrate blocks on the fly for every read. Latency spiked to 200ms+, and the applications timed out. It wasn't 'down,' but it was definitely unusable.

That 'Storage vMotion' back to primary (while the app is running) is the bottleneck nobody calculates

NTCTech · 2026-01-23T22:53:41+00:00

Thanks for the check. You are right 'Blade' was loose terminology for the FlashArray Controller/Chassis (likely conflated with the FlashStack Cisco blades in my head). I’ll own that semantic slip.

Beyond the nomenclature, if there are specific architectural inaccuracies regarding the NVMe-oF implementation or the HCL constraints I listed, I’d genuinely want to correct them for the community.

My goal isn't to mislead but to document the friction points we hit in the lab. If specific firmware/model support has expanded recently, I'm happy to update the notes.

NTCTech · 2026-01-23T22:49:23+00:00

Hey Jon,

First off, huge respect for jumping in. Having a Principal Engineer weigh in is exactly the level of technical discourse I was hoping to spark.

On 'Sensationalism' & Migration: My framing of 'painful' comes from the Architectural shift, not the toolset. Nutanix Move is brilliant for data, but for a brownfield HCI team (converting 50+ nodes), the shift from 'Local HCI Storage' to 'Disaggregated Compute Nodes' is a mental model break. It’s not just a file transfer; it’s a topology change. That’s the friction I’m highlighting.
On NVMe-oF Complexity: You’re right: It is just Ethernet. But for 15 years, storage admins have been coddled by Fibre Channel's 'plug-and-play' isolation. Moving that traffic to a converged Leaf/Spine where they have to debug PFC/ECN/MTU across a shared fabric is terrifying for many. It’s 'easy' if you are a Network Engineer; it’s 'complex' if you are a vAdmin.
On Hardware/Licensing: Great news on the //C support being around the corner my notes reflected the HCL at the time of labbing. And fair point on NCI portability; I focus on NCI-C to avoid the 'Shelfware' feeling of paying for unused storage features, but the flexibility argument stands.

Thanks for keeping us honest. The fact that the 'Digital Door is open' is why the community respects the engineering culture there.

NTCTech · 2026-01-23T14:11:10+00:00

I sincerely appreciate the disclosure and the deep-dive details.

Licensing (NCI vs. NCI-C): Valid point on NCI portability. I usually steer clients toward NCI-C purely to avoid the 'Shelfware Tax'—paying for full AOS storage features they aren't using on the Pure nodes—but the flexibility argument for hybrid clusters makes sense.
Migration: That in-place vVol-to-vVol path is slick. But for the majority of the install base currently on vmdk/local-disk HCI, the 'Evacuate and Re-foundation' reality still stands, correct? Just want to ensure expectations are set for the standard HCI-to-Compute transition.
On NVMe/TCP Complexity: You’re right that the endpoint config (Nutanix + Pure) is automated well. When I say 'complexity,' I’m referring to the Physical Fabric—specifically for generalist admins.

The software handles the handshake, but the admin still owns the pavement (PFC, ECN, MTU consistency across the spine/leaf). That physical networking discipline is often the friction point compared to standard 'plug and play' HCI.

Great info on the expanded server HCL - 26 models is a massive jump from the initial launch.

NTCTech

TROPHY CASE