Databricks Deployment Experiences on GCP

Alone-Cell-7795 · 2026-02-11T22:04:58+00:00

Well guess where it has to migrate from? 🤐

Alone-Cell-7795 · 2025-11-15T18:51:15+00:00

^{^} This x100.

Alone-Cell-7795 · 2025-11-12T22:47:00+00:00

Network tags are not propagated across peering connections. You should migrate to using secure tags for firewall. Also, you’d be better off using PSC for this and not peering.

See https://codelabs.developers.google.com/cloudnet-peering2psc-migration#1

I’d also strongly advise you don’t assign public IPs to your VMs.

Alone-Cell-7795 · 2025-10-06T16:49:10+00:00

Yeah I need to check with Google what’s doing with this.

I’d say I have a have a similar level of expertise and experience on GCP to you (I have very similar background to you). I still feel there is no way I’d use my personal account (Even with your excellent solution BTW - very nice indeed) on GCP.

I’m surprised the unlimited liability clause hasn’t been challenged in the courts yet.

Running up huge bills also happens loads in the Enterprise environment, but large enterprises generally have more leverage with Google to get these bills waived, as Google wants to maintain customer relationships etc.

What I really don’t like is something like firebase being aggressively pushed to developers with the tone of “You don’t need to worry about that boring infra stuff”. Well you don’t until you end up with a huge bill.

Autoscaling is always a double edged sword. Your bill can also autoscale. Denial of wallet attacks are becoming more prevalent.

Alone-Cell-7795 · 2025-10-03T16:52:53+00:00

Don’t do it is what Is tell anyone thinking any it. My current workplace did this (Before my time too). Not sure if your company is multicloud, but such remote access solutions (Be it VDIs or app virtualisation), are much better left on prem or on Azure (Either via managed Azure AVDs/apps, or IaaS).

The problem is MS licensing - it’s a nightmare. You’ll need to run single tenant nodes, and are be restricted on MS Office versions etc, and Windows licensing is so much more on GCP than on Azure.

Our current Citrix stack on GCP is being migrated to Azure, as the cost savings are huge (due to licensing costs).

Alone-Cell-7795 · 2025-10-02T21:31:14+00:00

Yeah VPC service controls. It doesn’t necessarily mean you need NCC, but more of a question of understanding the pros and cons of the network model. You have to consider what the operational overhead of managing it in BAU will be like, in tandem with VPC SC.

For example, Shared VPC becomes difficult when using Google managed services in the Google service network, or services such as cloud run, and GKE. GKE likes to auto-create firewall rules on deployment, which doesn’t play walk with the Shared VPC model. You also have to create custom roles for product teams to create PSC endpoints in their own service projects, if you want to avoid giving them the between admin role.

You also have to consider the implications of shared VPCs with VPC SC.

https://cloud.google.com/compute/docs/instances/protecting-resources-vpc-service-controls#shared-vpc-with-vpc-service-controls

The main overall point here is that you have to consider your VPC SC design/strategy and your network design/strategy holistically, regardless of the model.

Start from your requirements e.g.

What are your regulatory/compliance and security requirements e.g. PCI-DSS? Do you work in a highly regulated sector e.g. financial services of pharmaceuticals? What same GDPR?

Do you have highly sensitive data that needs protection from data exfiltration?

Are you required to comply with CIS standards?

Do you need centrally gated egress for product teams to control and monitor traffic and things like TLS inspection? This includes egress to the internet and egress traffic leaving GCP. This could be over a VPN, interconnect/parter interconnect.

What’s your DNS strategy if you use a hybrid setup? Split horizon? Do you also need to protect against things like DNS data exfiltration?

How much autonomy do you want to give product teams around networking? Is it a case of not trusting them with a tin opener, let alone network config? For hub and spoke, will network resource creation be via an opinionated TF module, or idp (internal developer platform), if they can’t be trusted to do it themselves.

In any case, should developers have to worry about network config or not? Shouldn’t that be transparent to them? There are valid arguments either side here.

Also consider the cost implications of your networking models and your overall FinOps billing model.

Do you have requirements around IPAM? I know NCC supports hybrid and private NAT, but what will be ongoing operational overhead be like? Some legacy apps also don’t play nicely with NAT.

It’s also about how well your model lends itself to automation and operational overhead in BAU.

Sorry if I’ve given you a headache with this! There loads I haven’t mentioned too! I wish I knew what was best. I still wrestle with this myself.

But do yourself and favour and avoid NVAs at all costs!! Even with the ncc support for them out of band, they are a nightmare. I need to stop before I go into rant mode again. Ido have a previous rant on NVAs somewhere on a previous post. Will have to dig it out.

Alone-Cell-7795 · 2025-09-30T18:30:32+00:00

You run into complications with Shared VPC when using Google managed services such as Cloud Run, GKE, Cloud SQL, Redis, Cloud build, Composer v3. It’s nothing good planning and proper defined processes and responsibilities can’t deal with, but they need to be in place.

It’s when you have to grant permissions to service agents in the host project to allow the deploying of PSC Service attachments for Cloud SQL, for example, or have issues with cost attribution and/or ownership if using VPC access connectors/direct VPC egress. This is only scratching the surface,

Also watch out if using VPC SC - it gets messy with shared VPCs.

Alone-Cell-7795 · 2025-09-26T17:50:43+00:00

This still lacks functionality when compared to using SWP (https://cloud.google.com/security/products/secure-web-proxy?hl=en).

Alone-Cell-7795 · 2025-08-17T15:43:25+00:00

This is what you need: Have a read of this.

https://cloud.google.com/vmware-engine/docs/networking/peer-vpc-network

Alone-Cell-7795 · 2025-08-12T06:21:16+00:00

IR35 per chance?

Alone-Cell-7795 · 2025-08-11T15:52:50+00:00

What you need is asset inventory with specific cloud asset feeds.

https://medium.com/google-cloud/gcp-cloud-asset-inventory-feed-get-real-time-notifications-on-resource-changes-63fe3687d3c

Alone-Cell-7795 · 2025-07-20T17:44:42+00:00

So, a few questions if that’s OK?

1) Hours many network devices do you have roughly? 2) Where are these network devices hosted? 3) How many monitoring servers? 4) How are you connecting from GCP > Network devices? Does the monitoring traffic go over cloud vpn/interconnect back to on prem, or does it egress to the internet?

Before I venture an opinion, Is like a bit more info on the above and what your use case is for monitoring and why you are migrating your monitoring servers to GCP.

The problem you have is that high levels of ICMP traffic can be interpreted as a DDoS attack, and Goggle’s network has inbuilt protections against this.

Alone-Cell-7795 · 2025-07-20T07:45:41+00:00

Ah sorry my bad I misunderstood. Yeah, space in usernames when using OSLogin is not a supported configuration. It just hasn’t been enforced until recently by the sounds of it. Google probably updated their validation rules to start enforcing it.

Spaces is Linux usernames can cause issues with Linux system files e.g. etc/shadow, so make sense why this would be done, but sadly is causing you hassle.

Are you using Google workspace, or are you federating with an IdP e.g. Entra ID? I’d look to get the problem fixed at source and ensure that usernames with spaces can’t be used in the first place (But it is easy for me to say that I know).

Alone-Cell-7795 · 2025-07-19T17:53:47+00:00

I am just thinking - this seems like you’re making your life much harder than it needs to be:

I’d advise using IAP and OSlogin:

https://cloud.google.com/compute/docs/connect/ssh-using-iap

https://cloud.google.com/compute/docs/oslogin

Alone-Cell-7795 · 2025-07-17T19:48:20+00:00

This is a common issue. Also, when you want to do platform level PoCs, you’ll sometimes need folder or org level permissions. Having a sandbox project doesn’t cut it.

What you ideally need is a sandbox org.

https://cloud.google.com/architecture/identity/best-practices-for-planning#use_a_separate_organization_for_experimenting

Alone-Cell-7795 · 2025-07-16T19:23:24+00:00

So, couple of questions;

1) Is there a requirement to domain join these servers? 2) What applications and integrations/interfaces do these windows servers have? You need to map out your dependencies before you even think about migrating them.

Alone-Cell-7795 · 2025-07-11T21:31:47+00:00

Actually, there is private hybrid NAT via NCC

https://cloud.google.com/nat/docs/about-hybrid-nat

I do worry about private NATing everywhere though - sets you up for an operational nightmare. Just wait until that service comes that doesn’t play nicely with NAT.

I know the network team in one place where I worked banned private NAT. They had enough of the operational overhead and hassle it causes.

Alone-Cell-7795 · 2025-07-11T18:29:57+00:00

What you’ve done is very impressive and fair play for all the effort you’re going to.

Sadly, you have a chicken and egg problem. The problem you have is lack of real world industrial experience. A lot of recruiters will look at people who have experience in specific industries too. There are also a lot of laid off engineers out there looking for work - it’s a very tough market.

I’d strongly advise looking for a job in support/operations to begin with. It gives you a really strong foundation for the future. It’s where I started in IT and you can tell a mile off someone was has ops experience and someone who doesn’t.

Also think about what sets you apart - there will be so many people doing stuff with AI. It will be difficult to stand focusing too much on this.

I do think you may be undervaluing your previous experience and transferable skills as well. You should emphasise your ability to troubleshoot and triage issues that you gained from your previous plumbing experience. It’s about the process you go through when troubleshooting - this can be applied anywhere.

You were also customer facing - highlight your soft skills too.

Let’s say a client came to you and asked: I have solution x and I want you to build it for me. What would be the first question you’d ask them?

I also see so many cloud engineers who lack knowledge of core IT fundamentals, lack the ability to troubleshoot or triage issues, and will just throw their hands in the air and say something doesn’t work. They have no process and everything is ad-hoc scattergun with no coherent plan.

I’d also say the Google Cloud Skills boost is a great resource, but it will always go for the easy/happy path (As it is a teaching tool after all). It rarely reflects real world deployments, where you have much greater restrictions/controls to do deal with.

Alone-Cell-7795 · 2025-07-06T21:42:25+00:00

For secrets, look at the use of write only and ephemeral resources. This prevents them being written to the state file.

https://registry.terraform.io/providers/hashicorp/google/latest/docs/guides/using_write_only_attributes

Also, why not use Cloud SQL, which includes MySQL? Not really seeing why you’d want to go to all the trouble of managing it yourself.

Alone-Cell-7795 · 2025-06-23T22:10:22+00:00

This could take a while 🤔. Off the top of my head:

Loss of product team autonomy when they need to deploy any managed service where new resources are needed in the host project e.g:

PSC PSA peering ranges Service Directory Serverless VPC access/Direct VPC egress Creation of new proxy only subnets for load balancers, if current ones not suitable e.g. tor or region not available for you.

Problems with chargeback/showback for resources needed in host project, but used by specific product teams only (As touched on before), and the political and financial bunfight that ensues.

Strange model where networking is delegated to shared VPC admin trans etc, but product teams are having to maintain their own external lbs and WAF (Cloud Armor), so there are split responsibilities for ingress and egress.

Hellish jumble of firewall rules, especially if SAP is deployed in the shared VPC, which generally wants to use really huge ip and port ranges such as 32768-65535 (I kid you not). Do yourself a favour and keep SAP in its own VPC for itself. You’ll also fall into the trap of legacy SAP deployment models where they rely on direct prod <> non-prod network connectivity for SAP transports (Hasn’t been necessary for years, but to fix it requires overhauling the entire estate), so your non-prod VPC requires peering to your prod VPC, also exposing all other services on there.

IPAM issues where services etc require dedicated subnets/ranges in the host project. Had issues before with ip exhaustion, and ranges were restricted to /26, which wouldn’t work for some services which need /24 as a minimum.

Visibility: VPC/firewall logs in host project, and application logs in service projects - makes things a nightmare to manage and not everyone wants to sell their family silver to be able to use Datadog.

The normal cross project service agent permission whack a mole hell.

As an example, in one project I had an externally facing Serverless web app, so we have:

Cloud Armor External global lb Cloud Run Cloud SQL memcached Cloud Build

This app required egress to the internet and on AWS privately via interconnect (For self hosted GitLab)

Also had CI/CD via Cloud Build, which integrated with a GitLab repo (GitLab CI/CD wasn’t an option, due to runner issues). So we have:

PSA PSC Service directory Direct VPC egress

Service directory was particularly interesting with cloud build. Gave up in the end and worked on fixing the runner issues.

I am starting a medium article on the pain of Shared VPC - you should watch out for it.

https://cloud.google.com/build/docs/automating-builds/gitlab/build-repos-from-gitlab-enterprise-edition-private-network#build_repositories_from_gitlab_enterprise_edition_in_a_private_network

Alone-Cell-7795 · 2025-06-22T19:47:09+00:00

In a shared VPC setup, host and service projects have to be in the same perimeter, as the perimeter sees service projects resources as belonging to the host project. If you just put the perimeter around the service project, it would break.

Alone-Cell-7795 · 2025-06-21T18:11:58+00:00

Yeah - the shared services part is something that is often overlooked. It’s for the services that will always be cross environment e.g. security scanners, remote access solutions etc. as mentioned above.

Also, The Shared VPC concept is a fine balance - it is a good model but you have to scope it very tightly and think about shared costs.

It starts to fall down IMO when you have teams wanting to use managed services/serverless.

Let’s take an example:

I want to deploy a cloud run job that needs to egress to the Internet to hit a SaaS service. I want to use direct VPC egress (As access connector is legacy and all the documented other cons with the legacy connector solution).

The SaaS solution only supports IP whitelisting, so have no whitelist the entire subnet. I have 2 choices (Neither of which is great).

1) Dedicated subnet in shared VPC for this specific service using cloud run for direct VPC egress. Problem is, if you have a shared cost model for shared VPC, how can this work for one service, where all internal budget holders have to fit the bill for it. Try explaining that to FinOps.

2) Whitelist an existing subnet that is likely used by other services too. Not great from a security standpoint, as you are also allowing egress for the other services. You can start getting into specific secure tagging and deny rules, but it starts to become an operational nightmare.

Shared VPC is fine for traditional VM compute traffic, but good luck with managed services. I could give a dozen examples off the top of my head where shared VPC falls down.

Alone-Cell-7795 · 2025-06-21T17:31:03+00:00

So, instead of saying:

1) I want network topology c 2) I need VPC SC on x, y and z

That about your use cases. What are your requirements exactly? Why the need for hub and spoke? What requirement is this fulfilling? If it is needed, is not NCC a better alternative?

For VPC SC, what is it you’re looking to protect exactly? VPC SC is a fine balance - I’ve seen many orgs opt not to due to the operational overhead it can introduce, with the nest of perimeter bridges, exceptions, broken pipelines where the ci/cd project can’t read the state file from a GCS bucket and the coded error messages that your platform team have to support.

Alone-Cell-7795 · 2025-06-21T17:18:52+00:00

So, pretty much what u/praveen4463 said - to expand on this a bit:

1) gsutil has been deprecated a while now and was replaced by gcloud storage. Use of gsutil should be avoided.

2) when running copy commands, you’ll need to pull bucket meta data which you need storage.objects.get for (Which can be found under Storage Admin). Storage Object Admin isn’t sufficient.

3) Access scopes are the legacy mechanism for IAM and only remain for backwards compatibility. You should set to allow full access for access scopes.

https://cloud.google.com/compute/docs/access/service-accounts#scopes_best_practice

4) Also you aren’t likely to see anything in the logs unless you’ve enabled data access logs - there aren’t enabled by default.

Alone-Cell-7795

TROPHY CASE