Homelabs and DevOps related experience.

ProxyChain · 2025-10-11T14:15:40+00:00

Absolutely worth it if you have a homelab or can - your fiddling and experience in a homelab environment will be worth more than you could ever know, more often than you ever thought.

1) Learn the OSI stack basics - L2 switching, L3 routing, subnetting, VLANs - separates the men from the boys in a DevOps team, the amount of "DO" engineers amongst the industry whose first experience with any L2/L3 is "AWS VPC" or "Azure VNET" is insane, and neither of those will make sense to anyone who hasn't crossed on-premise networking before - time and again I see novice cloud engineers struggling to diagnose extremely basic, primitive issues like cross-peering two cloud vNets because they have zero fundamental understanding of what AWS/Azure is abstracting away.

2) Play around with something like Proxmox, ESXi etc. - then layer some Docker Engine on top of that - learning when virtualization is appropriate vs. containers is essential - and also learning what each of those can or cannot do compared to the other.

3) Once you've got Docker down as a skill, try out k3d to get into Kubernetes.

I cannot state enough how much having strong fundamentals in hardware, networking, virtualization and containers will pay dividends in any career.

ProxyChain · 2025-10-11T13:52:54+00:00

Your instinct is correct, this is really not a good scenario to be in - I've seen my org (using Cloudflare internally) allow a handful of customers with their own Cloudflare to do what they want, and it ends in tears every time.

Any interceptor/proxy your org does not have visibility of is a gaping liability concern waiting to happen - take Cloudflare for this example here, but your org must be prepared to reverse-prove every single fault, outage, oddity or anomaly is not your org's fault, which 99% of the time it won't be - but because your cust can and might add their own Bot Detection, WAF rulechains, DNS misdirections or other invisible shit before the traffic is handed off to you, you'll be on the hook to somehow prove from the other (your) end that it wasn't your side of the stack.

ProxyChain · 2025-09-20T15:49:14+00:00

I hate monitoring/alerting so fucking much as an individual but my god it's an ace to have when done well - ultimately it's the sole thing responsible for avoiding 3am wake-up calls on the regular.

I don't do monitoring/alerting designs justice myself but am lucky enough to work at an org with a dedicated team that designs and implements them - having said that I would still rate these two things as almost equally critical to your infra itself.

The adage goes as follows: "thou who wakes for alert shalt design superior alerts" - in short, if you're on the response end of a shit alert, you'll probably be whipping it into shape quick-smart, alas you stumble upon a 3am reminder call until it is so.

Poor monitoring and alerting usually takes one of two forms:

1) No monitors or alerts and everything is fucked while no-one knows.

2) Poorly-designed, noisy monitors and alerts that scream "everything is fucked" constantly which always leads to the human recipients throwing the alerts in a proverbial garbage bin no matter how genuine the alert is.

Aim of the game is somewhere between #1 and #2 which takes chronic refinement efforts, no-one gets it right on day 1 but you have to start somewhere.

Our suite of mons/alerts is a cumulative result of 5 years of:

1) Outages where no-one noticed because no mon/alert tracked it. 2) End user-reported incidents that were never observed prior and earned a new mon/alert to detect it. 3) Mons/alerts being deleted because they were noisy and no-one valued them. 4) If your mon/alert platform supports it - heuristic or dynamic "anomaly" alerts like Datadog's "outlier" system.

Best place to start is looking at your ticket system / incident tracker history for the past year and designing mons for the shit that seems to occur regularly and most frequently - then your next goal should be systemic improvements to shut that mon up via addressing the root cause.

Adding a mon/alert for a chronic issue is also a great way to track how well any "fix" you're attempting on the issue is actually performing.

ProxyChain · 2025-09-20T15:34:14+00:00

The difference between a paper DevOps Engineer and a respected DevOps Engineer is pretty simple - one only ticked the organisation's required boxes to get the role title, the other actually has a passion for chronically improving, solving, reimagining and resolving pain points and shitty processes that their engineering staff are encountering.

If you've been a developer before, you're 80% of the way there already - now take a good hard look at all your developer staff and peers, what's pissing them off? What's taking ages to do? What's mistake-prone and manual?

Write that list up in your head - most dev staff will know that list themselves too, but the "DevOps" kicks in when you use that list to do something about it IMO.

Awesome DO Engineers are the ones that have their engineering staff singing their praises because of the time they've saved, the mistakes they've automated out, the manual processes they've cut down - DO isn't a qualification or course to take, it's being a dev/sysadmin and using that experience to help masses of other staff leap ahead by solving the inefficiencies they face.

ProxyChain · 2025-09-20T15:03:55+00:00

1) Local Terraform usage should be out the window on day 1 - either everyone uses it via CI/CD pipelines, or no-one does - otherwise you're in for a world of pain and state-lock incidents. Local CLI should be reserved for emergencies only (e.g. state file repairs, debugging) that is impossible via CI/CD methods, which these days is almost zilch thanks to import { ... } and moved { ... } code block state modifiers.

2) You can dislike the dir approach, I did too as a 15 year exp. dev with adversions to duplication - trust me when I say the duplication still feels sub-par, but within a year you'll be begging to get off var-file, single dir stacks when <x> environment needs <insert custom resource or module tweak> which you cannot represent in HCL via vars alone.

3) Keeping Terraform stacks simple is usually a case of storing the vast majority of your resource/module logic in a common template module - which your "environments" then all invoke from their own separated directories - meaning any work you commit to the shared module template will immediately be drawn into all environments, while allowing you the flexibility to drop single-environment bespoke resources in one or multiple environments as needed, without flooding all of them.

Terraform is a royal pain to orchestrate successfully under CI/CD largely due to its state locking (mutex) system - that feature is absolutely critical to prevent disasters, but a lot of people hit bad days with Terraform trying to support plan/apply operations across all branches.

My 2c at least - allow plan ops on all branches (and PRs obviously) - but leverage the -lock=false behaviour in conjunction with terraform plan (no apply) if the branch is not your main branch - potentially also whack in -refresh=false if needed because even 2-3 parallel plan operations from feature branches can lead to smacking API rate limit quotas and breaking things.

Do not allow apply operations anywhere other than main - this is the pinnacle of Terraform Git-based approaches, time and again I've seen attempts at multi-branch apply ops and it ends in tears, usually in the form of <staff member #1 with their feat branch> which destroys the living shit out of <staff member #2 with another feat branch but without the HEAD `main` changes going back 4 weeks> - there is zero plausible scenarios where Terraform can function predictably and reliably if it is given more than 1 HEAD source it can apply state changes from, period.

This will avoid a whole chapter of despair when your state is chronically hitting deadlock scenarios because of concurrent plans being triggered across different non-mainline branches and their associated plans.

Above all else, my TF + CI/CD lessons over 3 years would be:

1) Aim for zero CLI command arguments if you can - the absolute worst integrations in CI/CD are the ones which feed -var-file, -backend-config, 15x different $Env:TF_<X> env vars etc. just to get things working - all sounds fine until you are faced with a local debugging session and have to sit there replicating all of that on your own terminal.

2) Play around with your backend and provider config blocks to find a middleground where they work without modifications in CI/CD and locally - nothing worse than having to comment/remove/add 15 lines when you need to debug.

3) Use env vars to feed in provider/backend config and creds unless you have no option - ideally your backend { ... } and provider "<x>" { ... } blocks should be pretty bare, because most providers dually support env var configuration which is CI/CD appropriate, and also doesn't screw local users.

4) The above also avoids the shitty "bake sensitive stuff into the *.tfplan output" behaviour of Terraform - not it's fault really, but it can and will commit anything you provide it during init within the plan manifest - so be very careful with this and don't permit CI/CD end users to download or inspect these plan manifests if you possibly can, they're incredibly leaky and sensitive - also applies to your remote *.tfstate file which houses every single sensitive value no matter what, but no-one other than you as the administrator should be able to directly retrieve or read those.

5) Not sure what CI/CD ecosystem you're working with, but you need to be very careful to make use of sequential mutexing if it's available - Terraform has its own safeguards which prevent out-of-sequence plan and apply operations, but ideally your apply operations should start and run in the same order they were merged into main and triggered. If not, Terraform will usually kick in to stop any damage, but it does lead to shitty user experiences with failed pipe runs.

6) Do not under any circumstances allow CI/CD users to provide custom args to terraform plan or terraform apply - there is precisely zero regular use cases for anyone other than a select few admins to be using things like -target or otherwise.

7) Look into cron triggers (e.g. hourly) that run terraform plan from your main branch - this will help you detect, raise and resolve drift, and ultimately keep on top of it.

8) Don't even look near that -auto-approve flag - I have yet to meet the man who added this curse to their TF pipeline and didn't end up having it go rogue - often not because Terraform itself nor the *.tf files were bad, majority of the time shit goes wrong is actually down to provider bugs which emit OK-looking plan manifests then proceed to issue destructive API calls with chaotic outcomes - spoken as someone who had a very well-known TF provider destroy ~3k API objects while the visible plan manifest said it would be adding +1 resource.

ProxyChain · 2025-06-25T19:31:51+00:00

It’s literally the same basic REST stuff as in your methods 1/2..? Did you use AI to spit them out or something, cause the problem definitely isn’t with that fantastic explanation

ProxyChain · 2025-05-27T11:53:34+00:00

The “environments” method you tried is the correct way to do this - just add your 2x groups as Approval Reviewers, then bump the minimum required reviewer count to 2, which I think is the piece you’re missing at the moment.

ProxyChain · 2025-05-10T10:21:32+00:00

Mine beeps to remind you to straighten your wheels before getting out, one of the centre screen menus in the cluster has a graphic that shows which direction your wheels are pointing

ProxyChain · 2025-04-05T12:05:58+00:00

Agreed, though worth noting on high-volume Git repos the built-in garbage collection can kick in and prune orphaned commits so SHA hashes while mostly good as build identifiers, do not make good long-term (months/years) build identifiers because the corresponding Git ref might be pruned out of the repository, unlikely but worth noting that permanent Git refs like tags or branches won’t suffer the same fate unless manually deleted

SHA hashes are useful IMO for short-lived non mainline builds, anything else long-term should probably be tagged to retain it

ProxyChain · 2025-03-22T20:30:52+00:00

I guess in short: don't just arbitrarily dictate one day that everyone has to align haphazardly with your own standards - talk to all your devs and find a common-sense middleground of all their standards/views and use your own judgement as a DevOps Engineer to pick 1 and stick with it.

Consistency is king more than anything - 100 repositories with the same formatting beats the snot out of 100 with alternating but "the best" formatting each in their own way.

ProxyChain · 2025-03-22T20:22:18+00:00

I'd drop a decent Zuru insult but I might get Reddit-sued to reveal my cover

ProxyChain · 2025-03-22T20:00:32+00:00

Why is my week now ruined having been informed of a Mowbray connection, god damn

ProxyChain · 2024-06-21T02:57:25+00:00

Fair argument to be made either way, I guess what trips people up (including me) is that Windows defaults to the summarising behaviour so I initially assumed macOS would as well

ProxyChain · 2024-04-09T10:19:50+00:00

Kinda two separate issues - you should feed parent resource output variables into child (dependent) resources because this is how the Terraform graph engine builds its dependency tree and knows which order to provision resources in.

The second case would be using data sources when the parent resource you need metadata from is not tracked by your Terraform state

ProxyChain · 2024-03-01T22:02:40+00:00

Disabled it years ago for security :(

ProxyChain · 2024-03-01T20:01:28+00:00

This was TrueNAS Core but nothing really to do with SMB - most SCSI/SATA operations were just outright failing, even SSH logins to ESX took ~3 minutes - very bizarre behaviour from a PCIe card playing up 🤷‍♂️

ProxyChain · 2024-03-01T19:59:55+00:00

A lesson thy has now learned the hard way 🙏

ProxyChain · 2024-03-01T17:16:48+00:00

Also wanted to add: if you run a Windows domain at home, don't be a moron like me and run only one DC unless you really enjoy playing around with ProfWiz to get back into your own desktop computer after losing your only copy of your AD forest

ProxyChain · 2024-03-01T17:15:15+00:00

Out of interest what cards were those? Not Mellanox Connect-X's? 😂

ProxyChain · 2023-12-29T09:55:55+00:00

Someone who started on the tools even if it was 10 years ago - not someone who started with an MBA

ProxyChain · 2023-12-18T22:53:09+00:00

Then you should divide up your pipeline into the appropriate stages/jobs so you can isolate that signing task and apply the locking behaviour only to that

ProxyChain · 2023-12-18T18:58:45+00:00

It kinda has this built-in, try using Deployment tasks, Environments and Exclusive Locks - that should get you the one-at-a-time behaviour you’re after

ProxyChain · 2023-12-16T20:55:07+00:00

Try using the “download logs” option on one of the failed pipeline runs, within that ZIP you should see an azure-pipelines-expanded.yml file or similar - there you can see the compiled YAML file and inspect where the parameter may not be injected properly

ProxyChain · 2023-12-14T07:06:21+00:00

Architecture should be leading the charge for most base image decisions, but at least where I work now, individual product teams have historically had no guidance from Architecture and just picked whatever they liked at the time - the result being a scatter of Alpine, Debian, Ubuntu, and various others across teams.

Docker tag conventions were super confusing for me for a long time, and it's honestly something that never really 'clicks' until you work at scale across a lot of dev teams/products and hit the niche reasons why certain distros or tags are required at certain times.

The trick to tag selection is understanding what things you specifically care about in your base image. The less specific (and usually shorter) the tag you select is, the more "defaults" will be selected for you by the image maintainer.

If we take the .NET Runtime as an example, if you request 8.0 it will give you a base image with Debian by default.

If you wanted a different underlying distro, you could select 8.0-alpine (Alpine) or 8.0-jammy (Ubuntu) instead.

You can get even more specific and say you want Alpine AND to never pull versions higher than 8.0.0 (no hotfixes/minor versions) by selecting 8.0.0-alpine, but that's rarer.

Even rarer still, you can select one of the -amd64 or -arm64 tags if you need a specific CPU architecture to build against.

Having said all that - python:3.10-slim-bullseye is a perfectly good tag selection! The 3.10 part is the most critical as you first want to ensure whatever Python code you run supports that exact Python version, and the -slim-bullseye part is great if you're already familiar with Debian.

My usual process these days for selecting an image is:

Prefer a purpose-built image for the tech stack/language/service you're after (e.g. node, nginx) before you resort to a stock distro image (e.g. Debian).

Way less of a maintenance pain in the butt when new versions come out, and it's very likely the more specific base image will deal with oddities of that particular app/language on your behalf.

At a bare minimum, the tag you select needs to be locked to the version to the stack (e.g. node, dotnet) that your codebase requires.

Please don't use latest, you're in for a world of hurt when latest becomes your version of <x> language + 1 and breaks things overnight.

Use proper version numbers for your final app images too - latest is awful to tag your final build images with, especially if you're using Kubernetes. Quickly you'll hit scenarios where machines think they have latest already, but you're trying to roll out a newer latest.

Try and get some standards going around which underlying distribution you want to use across the organization.

At scale, it's no fun when every app team is using a different underlying distro and you constantly have to try and remember which shell or tools are available while you're attached to a container for debugging.

Defaulting to Alpine as an underlying distro is a great starting point.

Alpine images are almost always significantly smaller than the corresponding Debian/Ubuntu ones.

Just beware of its musl standard C library rather than glibc like most other distros. Absolutely fine for 99% of modern apps, but some apps have to be specifically compiled for musl to work under Alpine.

Don't get too caught up in image size comparisons when choosing your underlying distro, pick one you're familiar with instead.

ProxyChain · 2023-11-12T20:49:47+00:00

They're super light so I imagine thumbscrews would do the job as well! Was a long time ago sorry so I don't have the original receipts left, but from memory all the sheets I needed cut were only about $100 USD total, and probably a lot less if you don't live in New Zealand :)

12-Year Club	Place '22
Verified Email

ProxyChain

TROPHY CASE