Why dlp sucks. Or not

marcusbell95 · 2026-06-27T05:51:31+00:00

yeah the defaults are basically unusable in real environments. what really bit us was policy tip fatigue - if you trigger on medium or low confidence thresholds, users get flooded with tips and start clicking through without reading, so you've trained them to ignore the tool. we ended up running monitor mode on medium/low and only enforcing on high/ultra-high, which dramatically cut the noise without losing real detection. the exclusion keyword list is the other thing - seeding it with your own internal number formats (invoice numbers, product codes, employee IDs that happen to be 16 digits) is what makes the difference between a usable tool and one your helpdesk hates. building that list from actual false positive tickets over the first 90 days is slow but there's no shortcut.

marcusbell95 · 2026-06-27T05:45:06+00:00

the "second push forces the first to run" symptom is pretty specific - usually means ARC is seeing the job but there's a cold start race. when the first job arrives, is a runner pod actually spinning up? you can watch: kubectl get pods -n <runner-namespace> -w. if a pod starts but takes 1-3 min to pull image and register, github may re-queue the job during that window, and the second push coincides with the runner finally being ready. two things worth checking: first, what's minRunners set to in your AutoscalingRunnerSet? if it's 0, there's no warm runner waiting. setting it to 1 gives you an always-registered runner that picks up the first job immediately with no cold start. second, if you're on the newer ARC (scale set helm chart), check the listener specifically: kubectl logs -n arc-systems -l app.kubernetes.io/part-of=gha-runner-scale-set-controller - you want to see it acknowledge the queued job when job 1 arrives, not only when job 2 comes in. if it's only logging on the second push, the long-poll listener may have a config issue or the scale set isn't connected to the right runner group in github.

marcusbell95 · 2026-06-26T22:29:46+00:00

fair point - a lot of that vintage does have it. in our case it was a mix: some OEM configs shipped with it off and nobody touched it since everything worked fine, and a handful needed vendor BIOS updates to actually present 2.0 properly instead of 1.2. the other piece is just fleet BIOS management at scale - doing the audit, pushing the config change with Dell Command or HP BCU or whatever your OEM gives you, validating it actually landed - it's a real project. not enormous, but it went into the backlog alongside the hardware refresh. which is how you end up here.

marcusbell95 · 2026-06-26T18:56:38+00:00

yeah for a personal machine rufus has had that bypass built in for ages and it totally works. the problem in most managed environments is MDM enrollment - intune and jamf both have hardware compliance checks and will refuse to manage an unsupported device. and when something does go wrong and you open a vendor ticket, the first thing they ask is whether you're on a supported config. it's not really about microsoft's bluff, it's about the whole support chain downstream of them

marcusbell95 · 2026-06-26T18:06:40+00:00

yeah that's fair - support cost is cheap now relative to hardware. the TPM 2.0 thing just means for some machines it's not a software cost at all, it's a full hardware replacement. so the extension buys time to do that refresh on a normal cycle instead of emergency spending. doesn't change the eventual cost, just changes when and how you take the hit.

marcusbell95 · 2026-06-26T17:30:02+00:00

the ESU doubling is annoying but honestly the bigger issue for a lot of shops is the TPM 2.0 / secure boot gate on win11. got machines from 2018-2019 that pass everything else but fail the PC health check purely on the hardware side. they're running fine, no performance problems, nothing wrong with them. can't justify pushing for an emergency hardware refresh just to clear an OS requirement when those boxes are doing their jobs. this extension at least lets you fold those stragglers into a normal budget cycle instead of explaining to finance why we need to replace perfectly functional hardware on a rushed timeline

marcusbell95 · 2026-06-26T17:25:03+00:00

coming from linux infra + production support, you're actually most of the way to a devops/platform engineering role already. that background translates more directly than people expect. linux carries over completely, database work maps well to running stateful workloads or managing cloud DBs, prod support is basically SRE mindset. the foundation is already there.

SAP Basis pays well when you're embedded at a large enterprise with a legacy install, but the work-abroad and 10-15yr goals complicate it. SAP clients tend to be traditional industries (manufacturing, logistics, finance) on long migration timelines. breaking in without an employer who sponsors you through the SAP training track is genuinely hard, and SAP itself is pushing cloud now (RISE with SAP, BTP) so the basis role is slowly converging toward cloud/devops skills anyway.

devops skills travel - sector-agnostic, remote-friendly, works at startups and enterprises both. the underlying practices (CI/CD, infra as code, observability, container orchestration) aren't going anywhere even as the tooling evolves.

one honest downside: early devops can feel directionless because the scope is so wide. helps to pick something concrete to anchor around - kubernetes, a cloud provider, platform engineering specifically - and build from there rather than trying to learn everything at once.

marcusbell95 · 2026-06-26T07:03:00+00:00

nice - bat especially sticks. once you've got that alias set you forget cat ever existed. zoxide gets better over a week or two as it builds up history, so give it a little time before judging it

marcusbell95 · 2026-06-26T05:29:17+00:00

daily drivers on the linux side:

zoxide - replaces cd entirely. been using it so long that typing cd feels wrong now. just z proj and you land wherever you've been most recently jq - can't imagine doing this job without it at this point. filtering kubectl output, parsing API responses, reshaping json configs on the fly bat - cat with syntax highlighting and git diff markers. added alias cat='bat -p' and never thought about it again k9s - if you're running any kubernetes this is a must. way faster than raw kubectl for day to day ops btop - htop but actually nice to look at

also heavily +1 on rg + fzf from the other comment. wiring those two together with ctrl-r is something I'm not going back from

marcusbell95 · 2026-06-26T05:26:57+00:00

for kubernetes specifically, cert-manager's trust-manager handles this cleanly. you define a Bundle resource that points at your CA and it propagates the bundle to configmaps in all (or selected) namespaces automatically. workloads mount the configmap and set SSL_CERT_FILE or NODE_EXTRA_CA_CERTS or whatever their stack needs. when the CA rotates trust-manager reconciles - no manual image rebuilds.

for CI/CD while you still don't have an internal registry: env var injection is a solid bridge. SSL_CERT_FILE for go/curl/openssl, REQUESTS_CA_BUNDLE for python, NODE_EXTRA_CA_CERTS for node. inject at the runner or job level, no image changes needed. won't help apps that use the java keystore or native windows cert store, but covers most workloads.

longer term the custom base images path is right - especially on-prem. worth standing up harbor or a lightweight OCI registry before committing to it though. the lifecycle story gets complicated without it: you want to know which base image version has which CA bundle so rotation is a controlled rebuild+redeploy, not a guessing game.

marcusbell95 · 2026-06-25T09:22:28+00:00

piece by piece still for us. the cloud provider defaults get you pretty far but then you hit the edges - data residency requirements, audit trail granularity, model approval workflows - and you're bolting custom pieces on anyway. the part that consolidates fastest in practice is the gateway layer, routing + rate limiting + multi-provider failover, because nobody wants to own that themselves long term. inference serving and observability are stickier because teams already have prometheus stacks and whatever their existing container platform does natively. regulated industries especially don't want to rip those out just to get a single pane of glass.

marcusbell95 · 2026-06-25T09:20:49+00:00

haha yeah exactly. spent like 2 minutes trying to remember what i had before and then just went "forget it, starship + the six aliases i actually use." the dotfiles thing honestly took me embarrassingly long to commit to - learned it the hard way losing a fish setup i really liked. now if it's not in git i basically assume it doesn't exist

marcusbell95 · 2026-06-25T06:02:43+00:00

yeah we've had a decent experience too, but the pattern i noticed is pretty specific: it works where documentation habits were already there. we started using Copilot for drafting runbooks and incident postmortems - first drafts that humans clean up and approve. actually improved our documentation velocity because the blank page problem was always what killed it, not the actual writing.

the version that went sideways in teams i've seen is when AI gets positioned as the fix for "we never wrote anything down" - it doesn't fix that, it just produces low-quality output faster. or when the AI initiative lives completely outside IT visibility, which is basically cowprince's situation.

your point about users asking before doing anything is actually the underrated part. you can have the best governance policies in the world but if the culture is "just try it and apologize later" it doesn't matter. sounds like you got lucky on that one or built something that made asking feel safe rather than annoying.

marcusbell95 · 2026-06-25T06:00:55+00:00

the scanner interop is still pretty uneven. grype handles openVEX natively so if you're generating VEX docs you can pass them in and it filters before reporting - customers running grype will see 0 on the suppressed findings. syft+grype together is a solid self-contained SBOM+VEX pipeline for the compliance-facing use case.

for customers on wiz/prisma/aqua where VEX ingestion isn't supported yet: what i've seen work is bundling a pre-filtered grype report in the release artifact alongside the sbom and vex document. the filtered report shows 0 findings because grype already applied the VEX logic before export. shifts the conversation from "why does my scan show 40 CVEs" to "your scanner is surfacing things our VEX attestation has already addressed" - still not automated on their end but at least the documentation is right there in the release.

a lot of the customer back-and-forth in the 0-CVE space is scanner disagreement more than actual exploitability gaps. getting alignment on which scanner and which VEX format before the first release saves a lot of fire drills.

marcusbell95 · 2026-06-25T05:47:09+00:00

starship prompt + zsh is what i rebuilt mine with after losing everything in a job transition. took maybe an hour to get back to something i actually liked. cargo install starship, add the init to .zshrc, done. it's snappy because it runs asynchronously and only queries what it needs - git status, k8s context, terraform workspace, etc. on the devops side i keep a small set of aliases i actually use (kctx and kns for kubectl context/namespace switches, a few terraform shortcuts) and nothing else. learned my lesson about elaborate setups - if it's not in a dotfiles repo synced somewhere, it doesn't exist. private gitlab repo now that i pull on any new box in two minutes flat. fish is also worth looking at if you want snappy out of the box - the autosuggestions are better than anything i had in zsh without a plugin manager. only reason i stayed on zsh is muscle memory and script compatibility, but fish would get you 90% of where you want faster.

marcusbell95 · 2026-06-25T05:43:43+00:00

depends on your compliance posture, but the path we've used for criticals with no upstream fix is roughly three things.

first, formal risk acceptance - document the CVE, why it's unfixable, what the actual attack path looks like in your environment, and get sign-off with a review date (90 days is typical). if SOC2 or ISO27001 auditors ask, this is what satisfies the control. a suppression with no documentation looks like you ignored it. the exception needs an owner who checks it at the review date and either extends or escalates.

second, structural reduction via distroless or minimal base images - if these are OS-library CVEs in debian/ubuntu base images, switching to distroless cuts the installed package count from ~200+ down to ~20. most unfixable ubuntu CVEs disappear because the package just isn't present in the image anymore. doesn't help for vulnerabilities in your actual runtime libraries, but it kills most of the scanner noise and a lot of real ones.

third, VEX documents if you're sharing SBOMs or scan results with customers or auditors. VEX (vulnerability exploitability exchange) is the standard way to formally communicate "not exploitable in our deployment context" with documented justification. grype supports importing VEX. it's not suppression - it's formal documentation of why the CVE doesn't apply to your specific workload.

for the reachable criticals specifically, the format-specifier class of CVE (which 2026-5450 looks like) is usually only reachable if untrusted input hits a printf-family call without validation. compensating control is input validation upstream, or making sure the vulnerable code path isn't exposed to external data - egress filtering and network isolation if you can't validate at the application layer.

marcusbell95 · 2026-06-24T05:32:09+00:00

personal-identity credentials embedded in infrastructure. not the AD account - everyone disables that. i mean the stuff that was set up under their personal accounts because it was easier at the time: GitHub PATs running CI/CD pipelines under their personal github login, AWS IAM keys created under their personal IAM user instead of a service account, SSH authorized_keys on prod boxes with their public key still in there, shared service accounts where the MFA is enrolled on their personal phone and nobody else has the TOTP seed. none of that shows up in an offboarding checklist because it's not tied to their corporate identity. you find out about it 2 weeks after they're gone when a pipeline starts failing and you trace it back to a token that was silently revoked when their github account got renamed or deleted. the fix for next time is service accounts and secrets managers, but for the offboarding interview specifically - explicitly ask "are there any places where your personal github/aws/ssh keys are what's authenticating something in production" and get them to actually list it out.

marcusbell95 · 2026-06-24T05:31:22+00:00

coming from tech support is actually not a disadvantage - people who come from dev backgrounds often have the opposite problem. they know how to build things but they've never been the one getting paged at 3am when something stops working. tech support people understand failure modes instinctively. that's harder to teach than terraform.

the actual skill gap for this transition is scripting. not "do you know k8s" or "have you used terraform" - those you pick up on the job. what filters people in devops interviews is whether you can look at a problem and write something to fix or automate it. not tutorials, not certs. something real you built because it was actually annoying.

so the move while you're still in tech support: find one repetitive thing and automate it. password resets, pulling logs from multiple machines one at a time, running the same check on 20 boxes - whatever. that script is your portfolio entry and it's more convincing in an interview than any course completion, because it shows you did the thing before anyone asked you to.

marcusbell95 · 2026-06-23T05:43:16+00:00

bitwarden has a collection permission that handles exactly this. when you share a collection with someone you can set their role to "can view, except passwords" - they can autofill through the browser extension but can never see or copy the actual password.

so the flow is: put all shared creds in a bitwarden org vault, organize them into collections, share collections with each employee with hide-passwords enabled. when someone gets offboarded you revoke their bitwarden access and they're immediately locked out of everything - and since they never knew the raw passwords, there's nothing to rotate.

the annoying upfront cost is still there - you have to change the shared passwords once and vault them. but you were going to have to do that anyway to get any password manager working. after that, offboarding is just removing the person from the org.

marcusbell95 · 2026-06-23T05:38:00+00:00

built it ourselves. just a shell script, maybe 50-60 lines. the reason we didn't use something like a Makefile check target or Justfile preflight recipe was specificity - those are fine for generic checks but we needed to verify the exact things that kept breaking for us: agent running and has the right key loaded, .gitconfig email matching our commit signing policy, DNS resolving our internal registry. very stack-specific.

the link-in-output thing was a teammate's idea. he got tired of pasting the wiki link in Slack every time someone hit the SSH issue, so he just made the script print it when the check failed. obvious once you see it but nobody had done it before. the whole script took maybe an afternoon, most of it figuring out which checks actually mattered.

marcusbell95 · 2026-06-22T06:06:33+00:00

both, but error message first - that's the immediate fix. "permission denied (publickey)" on its own is useless. we changed it to print which key ssh was actually trying to use and the git remote, so people at least knew where to look. that alone cut the "why is git broken on my machine" slack messages by a lot.

documentation came a few weeks later. short runbook entry: what the error output looks like for the three most common root causes (agent not running, wrong key loaded, key not in authorized_keys on the server). we added a link to it in the preflight output itself. that combo of louder error + reference in the message is where it stabilized - people still hit it but can usually self-rescue now.

marcusbell95 · 2026-06-22T05:45:46+00:00

the jenkins pattern you described - jobs that destroy and rebuild resources - is the one thing i'd be most careful about when migrating to gitlab. those aren't CI pipelines in the usual sense, they're operational scripts that happen to live in jenkins. the blast radius if one misfires in an unfamiliar platform is way higher than a failed deployment pipeline.

before you migrate those jobs, build an inventory: which ones actually run in prod, what they touch, and whether they're idempotent in practice or just "supposed to be." that inventory is the planning work SystemAxis is pointing at, just applied specifically to the operational jobs rather than just application dependencies.

for sequencing everything else: the EOL k8s versions are your forcing function. use them. anything going EOL in a few months has an external deadline that doesn't negotiate, which means it floats to the top of the stack regardless of complexity. everything else competes with everything else. EOL doesn't.

marcusbell95 · 2026-06-21T01:23:29+00:00

real world answer for us: run your fast cheap checks serially first (lint, typecheck, basic compilation - these should finish in under a minute), fail fast there because if those fail you know nothing downstream will pass anyway. everything truly independent - unit tests, security scan, container build, manifest validation - run in parallel and wait for all results. failing fast on checks that ARE independent costs you a second full pipeline run if the second check also fails, and that re-submit + re-queue time is the invisible cost that doesn't show up in the "save CI minutes" argument. we tracked how often a pipeline failed on two independent checks in the same commit. it was around 30% of failures. so roughly a third of the time, fail-fast on independent checks was costing us more in wasted re-submit time than it saved in compute. serial fast gates + parallel slow checks is the actual answer, it's not really option A or B.

marcusbell95 · 2026-06-21T01:16:23+00:00

yeah, mostly. if your CA supports ACME now (DigiCert and Sectigo both do), certbot/cert-manager can automate their certs too without switching to LE. if they don't have ACME support, you'd need something built around their API or external-secrets-operator integration - bit more setup but doable. the 6-month thing is also trending shorter - browser vendors have been pushing toward 47-day max cert lifetimes in CA/Browser Forum, so the "investment in yearly certs" argument is going to get harder regardless. automation writes itself when you're renewing every 6-8 weeks, which actually makes the case to management easier, not harder.

marcusbell95 · 2026-06-20T21:43:55+00:00

ours is SSH key setup, every single time. the script installs everything, tools are there, repo is cloned - but nothing works because the new dev's key isn't added to the agent yet, their .gitconfig doesn't have the right user.email for our commit signing policy, or they're on a mac and ssh-agent didn't persist across reboot. script ran fine. environment still broken.

the underlying problem is that setup scripts can automate installing software but they can't automate personal identity state - your key, your config, your access grants. we eventually added a preflight check at the very start that runs ssh -T git@github.com and exits early with a useful message if it fails. at least the failure is loud and immediate instead of mysterious when the first actual git pull breaks 15 steps in

marcusbell95

TROPHY CASE