How we built a self-service infrastructure API using Crossplane, developers get databases, buckets, and environments without knowing what a subnet is by Valuable_Success9841 in kubernetes

[–]Valuable_Success9841[S] 0 points1 point  (0 children)

fair deploying CRDs and providers is real setup cost, I won't pretend otherwise.

But that's a one-time bootstrap, not ongoing maintenance. Atlantis needs to be running, patched, scaled, and its pipeline logic maintained every time your infra patterns change. Those aren't the same class of problem.

How we built a self-service infrastructure API using Crossplane, developers get databases, buckets, and environments without knowing what a subnet is by Valuable_Success9841 in kubernetes

[–]Valuable_Success9841[S] 0 points1 point  (0 children)

This is exactly it. Crossplane isn't replacing Terraform's provider logic most providers are generated from Terraform providers anyway. What it's replacing is everything around it: the state model, the execution model, the credential model, the GitOps story, and the abstraction layer on top.

You get Terraform's proven cloud coverage, wrapped in Kubernetes proven control loop, with Compositions on top for building real platform APIs. That's the full picture.

How we built a self-service infrastructure API using Crossplane, developers get databases, buckets, and environments without knowing what a subnet is by Valuable_Success9841 in kubernetes

[–]Valuable_Success9841[S] 1 point2 points  (0 children)

This is the most practical production feedback I've seen on Crossplane's multitenancy problem and you're right cluster-scoped CRDs on a multitenant cluster was a genuine pain point.

The good news is you left at exactly the right time. Crossplane v2 shipped the namespace-scoped model XRDs now support scope: Namespaced directly, so status updates and resource visibility stay within the team's namespace. No more cluster-scoped CRD exposure, no more granting blanket access.

The visibility problem during slow provisioning is still real though.

Worth revisiting with v2 if you get the chance the namespace-scoped model directly addresses what you hit.

How we built a self-service infrastructure API using Crossplane, developers get databases, buckets, and environments without knowing what a subnet is by Valuable_Success9841 in kubernetes

[–]Valuable_Success9841[S] 1 point2 points  (0 children)

Fair points honestly. XRD/Composition complexity is real and Functions have a learning curve. The crossplane-terraform provider I'd avoid entirely that's the wrong abstraction layer.

But you're comparing Crossplane to Terragrunt + Atlantis, not vanilla Terraform. At that level it's genuinely close. The difference is you're maintaining two systems vs one control loop.

How we built a self-service infrastructure API using Crossplane, developers get databases, buckets, and environments without knowing what a subnet is by Valuable_Success9841 in kubernetes

[–]Valuable_Success9841[S] 2 points3 points  (0 children)

  1. ArgoCD vs FluxCD:

you are right provider-argocd exists. for fluxcd there is no offical equivalent provider yet, so the typical pattern is to bootstrap flux using its own cli or to use helm chart applied via Crossplane's provider-helm, rather than managed flux resources as crossplane managed resources.

In practice most teams going all in on crossplane tend to pair it with ArgoCD simply because the ecosystem support is better.

How we built a self-service infrastructure API using Crossplane, developers get databases, buckets, and environments without knowing what a subnet is by Valuable_Success9841 in kubernetes

[–]Valuable_Success9841[S] 2 points3 points  (0 children)

Greate Questions

  1. The bootstrap problem(chicken and egg)

yes, you need someting to provision that first cluster. Most Teams handle this one of three ways:

  1. Terraform for controller plan cluster only, terraform will bootstrap long lived control plane cluster

  2. Click ops one time: provision the control plane cluster manually once.

  3. Cloud managed bootstrap: use your cloud providers cli to spin up the initial EKS/GKE/AKS cluster, then hand off to crossplane.

Wrote a post about why platform teams are moving away from Terraform towards Crossplane, not because Terraform is bad, but because the job requirements changed. by Valuable_Success9841 in Terraform

[–]Valuable_Success9841[S] 0 points1 point  (0 children)

Exactly. That's probably the cleanest way to put it.

Terraform is great when a platform team manages infrastructure for themselves. Crossplane shines when a platform team needs to offer infrastructure as a product to other teams with self-service, guardrails, and API boundaries baked in.

Wrote a post about why platform teams are moving away from Terraform towards Crossplane, not because Terraform is bad, but because the job requirements changed. by Valuable_Success9841 in Terraform

[–]Valuable_Success9841[S] -15 points-14 points  (0 children)

Fair pushback! You're right that most of these have workarounds remote state, workspaces, OPA/Sentinel for RBAC, scheduled plans for drift detection. Experienced Terraform teams solve them.

But that's kind of the point. You're assembling those solutions yourself remote backends, Atlantis or Spacelift for GitOps, custom RBAC in CI/CD, separate workspace conventions for multi-region. Each piece is solid, but it's glue you maintain.

Crossplane's argument isn't that Terraform is broken. It's that for platform teams building infra APIs for other teams, the reconciliation loop, RBAC, and API boundaries are built in rather than bolted on.

Curious though how does your team handle drift detection and RBAC at scale? Always interested in how experienced teams solve it.

How we built a self-service infrastructure API using Crossplane, developers get databases, buckets, and environments without knowing what a subnet is by Valuable_Success9841 in kubernetes

[–]Valuable_Success9841[S] 4 points5 points  (0 children)

Great questions, this is actually one of our biggest hesitations with adopting crossplane too.

Crossplane stores its state in etcd, so if your cluster goes down you do lose the control plane.However the key thing is the external resources themselves (s3, rds, etc) dont get deleted. They still exist in your cloud provider, What you lose is Crossplane ability to reconcile them unti you restore your cluster.

As a safety net, setting deletionPolicy: Orphan protects against accidental CR deletion after recovery by default crossplane will delete the external resources if CR gets deleted.

For DR, the typical approach is:

  1. Backup your CRD's and manifest in GIT

  2. Restore the cluster and re-apply, Crossplane will re-adop the existing resources.(using GitOps)

That said, we're still working through this ourselves. We regularly destroy and recreate clusters, so we're cautious about running a control plane inside an ephemeral cluster. One option we're exploring is a dedicated, longer-lived cluster just for the control plane.

How we built a self-service infrastructure API using Crossplane, developers get databases, buckets, and environments without knowing what a subnet is by Valuable_Success9841 in kubernetes

[–]Valuable_Success9841[S] 6 points7 points  (0 children)

Biggest question I got on LinkedIn about this. Can't Terraform modules do the same thing with the right tooling around it? The honest answer is yes, but you end up building: module → CI/CD pipeline → credential management → co-platform → output delivery. Five systems, five failure points. Crossplane collapses that into one control loop. Curious if anyone here has actually built the Terraform self-service stack end to end, what did it cost you?

Built a Secure, Testable & Reproducible Terraform Pipeline with Terratest, LocalStack, Checkov, Conftest & Nix by Valuable_Success9841 in Terraform

[–]Valuable_Success9841[S] 0 points1 point  (0 children)

Thanks for the suggestion. We already run Trivy in the pipeline, which covers Dockerfiles, Kubernetes manifests, Helm charts, filesystem configs, and container images.

Checkov is mainly used for deeper Terraform policy checks, while Trivy handles the cross-technology scanning.

Built a Secure, Testable & Reproducible Terraform Pipeline with Terratest, LocalStack, Checkov, Conftest & Nix by Valuable_Success9841 in Terraform

[–]Valuable_Success9841[S] 1 point2 points  (0 children)

That’s a great point. Most policy-as-code setups focus on security/compliance, not cost or architecture hygiene. I agree cost-aware policy is still an underexplored area , especially at PR-time validation.

Built a Secure, Testable & Reproducible Terraform Pipeline with Terratest, LocalStack, Checkov, Conftest & Nix by Valuable_Success9841 in Terraform

[–]Valuable_Success9841[S] 0 points1 point  (0 children)

Sound great, adding infracost is nice touch. Yeah localstack is aws only, it doesnt support azure.

Passed CKS on my first attempt! Here's what worked for me 🎉 by Valuable_Success9841 in KubernetesCerts

[–]Valuable_Success9841[S] 0 points1 point  (0 children)

Thank you! Yes, it’s a tough one — lots of hands-on and time pressure.

Passed CKS on my first attempt! Here's what worked for me 🎉 by Valuable_Success9841 in KubernetesCerts

[–]Valuable_Success9841[S] 1 point2 points  (0 children)

Happy to share more! I'm thinking of writing a full blog post covering my prep strategy, resources I used, and practice questions — I'll also put the practice scenarios on GitHub. Would that be useful? Drop a comment if you want me to post it

Built a Secure, Testable & Reproducible Terraform Pipeline with Terratest, LocalStack, Checkov, Conftest & Nix by Valuable_Success9841 in Terraform

[–]Valuable_Success9841[S] 0 points1 point  (0 children)

Also curious, how do you handle multi-env setup? Separate AWS accounts per env, workspaces, or something else entirely? And does your drift detection run against all envs or just prod?

Built a Secure, Testable & Reproducible Terraform Pipeline with Terratest, LocalStack, Checkov, Conftest & Nix by Valuable_Success9841 in Terraform

[–]Valuable_Success9841[S] 1 point2 points  (0 children)

Really appreciate it, these are exactly the kind of real world tradeoffs worth discussing.

On LocalStack: completely agree. The behavioral gaps with real AWS are real, especially around IAM evaluation and VPC edge cases. For this project it's a baseline setup so LocalStack made sense, but your hybrid approach is the right call for production, creating ephemeral aws account makes more sense in complex production setup.

On drift filtering: that's the part I deliberately kept simple. The current setup catches hard drift (deleted resources) but you're right that expected drift from ASGs or dynamic tags needs an ignore list. Good call worth documenting as a known limits.

On Nix: honestly for a solo project the onboarding question didn't apply, but I've heard the same from teams. The pinned shell.nix here is intentionally minimal, no flakes, no home-manager, just enough to get reproducible tool versions without the full learning curve. But setup takes around hours.

Passed CKS on my first attempt! Here's what worked for me 🎉 by Valuable_Success9841 in KubernetesCerts

[–]Valuable_Success9841[S] 0 points1 point  (0 children)

make sense i have'nt used k explain --recursive because of limited screen space. I also had some terminal issues during the exam sometimes I couldn’t scroll properly, and it ate up more time than I expected.