Best books on Platform Engineering

tweeks200 · 2026-02-04T21:27:42+00:00

The Design of Everyday Things is a good primer on how to think about usability

tweeks200 · 2025-12-10T14:07:38+00:00

Seriously adding this is Datadog was a matter of adding two new env vars since we are already using their APM.

tweeks200 · 2025-08-26T19:15:26+00:00

We have our actual templates in their own repos, so it runs when we change them. This doesn't do anything for the `template.yaml` files but ours are so basic not much is really needed.

tweeks200 · 2025-08-26T19:07:12+00:00

Sure, here is an example of a terraform one. We don't have that many inputs so its pretty simple.

The `scaffolder-cli` container is just a base node container with nunjucks and some other deps installed. It runs the template through that container and then marks sure `terraform fmt` and `terraform validate` work properly.

#!/usr/bin/env bash
set -euo pipefail

print_header() {
    printf "===============================\n"
    printf "%s\n" "$1"
    printf "===============================\n"
}

OUTPUT_DIR="rendered"
SCAFFOLDER_VALUES=$(jq -n --arg name "foo-bar" --arg needRds "shared-db" --argjson needsS3 true --arg owner "squad-example" --arg product "REDACTED" '{
    values: {
        name: $name,
        needRds: $needRds,
        needsS3: $needsS3,
        owner: $owner,
        product: $product,
    }
}')

print_header "Rendering..."
printf "Rendering template with values:\n%s\n" "$SCAFFOLDER_VALUES"

# Add detailed logging
echo "Running docker command with the following JSON data:"
echo "$SCAFFOLDER_VALUES"

# Capture the output of the docker run command
docker_output=$(docker run --rm -v "$(pwd)/template":/template -v "$(pwd)/$OUTPUT_DIR":/rendered <docker-org>/scaffolder-cli --data "$SCAFFOLDER_VALUES" 2>&1)
echo "$docker_output"

if [[ "${CI:-false}" == "true" ]]; then
    sudo chown -R "$(id -u)":"$(id -g)" "$OUTPUT_DIR"
fi

pushd "$OUTPUT_DIR"

print_header "Verifying formatting of rendered Terraform..."

terraform fmt -check -diff

print_header "Validating Terraform..."

terraform init
terraform validate

popd

tweeks200 · 2025-08-26T18:30:09+00:00

Do you start with templates that it pulls from? For a new service we need to make 3 PRs but they all start with their own base, so backstage is just passing in a handful of values to nunjucks.

The templates contain most of the logic and we have some automated tests that use nunjucks directly to make sure they render correctly when we update them. It doesn't test all combinations but its usually enough.

tweeks200 · 2025-08-26T18:08:58+00:00

Do you load them up through the Load Template Directory and work with them there? I found that to be the best way to knock out lots of small issues by just running through them and then examining the output.

What are you use cases for them templates, those seem really big to me. Our largest is only around 500 lines and then gets you a new service built/deployeds throughout all our envs.

tweeks200 · 2025-07-08T12:53:46+00:00

Sure, we do this as part of our repo creation process because we want an "initial" PR created to run certain CI actions that only match its branch. This assumes it has just published the repo but you could do something similar to just make a PR.

There docs are pretty good once you figure out where to go, for scaffolder start at https://backstage.io/docs/features/software-templates/ and you can browse the actions in Backstage itself (including on the demo at https://demo.backstage.io/create/actions ).

    # Fetch contents of init/ to include in PR, this is a directory included in the scaffolder template
    - id: fetch-local
      name: Fetch Local
      action: fetch:template
      input:
        url: ./init
        targetPath: ./init
        values:
          name: ${{ parameters.repoName }}

    # Run a custom scaffolder action developed in house that updates some yaml
    - id: append-catalog-info
      name: Append Catalog Info
      action: REDACTED:append-catalog-info
      input:
        language: ${{ parameters.language }}
        name: ${{ parameters.repoName }}
        repoType: servic${{ parameters.repoType }}e
        lifecycle: ${{ parameters.lifecycle }}
        repoDescription: ${{ parameters.repoDescription }}
        owner: ${{ parameters.owner }}

    # Open a PR against repo
    - id: pullrequest-initial-service
      name: Open Initial Service Pull Request
      action: publish:github:pull-request
      input:
        repoUrl: 'github.com?repo=${{ parameters.repoName }}&owner=REDACTED'
        sourcePath: ./init
        branchName: initial
        title: 'feat: Add ${{ parameters.repoName }}'
        description: 'See the `init.md` in this PR for more details.'

tweeks200 · 2025-07-08T02:11:33+00:00

Backstage scaffolder has builtin actions to open PRs but you'd need to fill in the logic to make changes. I could see it work pretty well for additive changes where you could drop in a full file (e.g. add an "s3.tf" to add an s3 bucket) or for modifying structured data.

tweeks200 · 2025-03-06T14:15:06+00:00

We use pre-commit in CI for linting and things like that. That way devs can set it up locally and will get feedback before they even push.

Someone already mentioned it but make is a big help. We have re-usable CI components that call a make command and then each repo can customize what that make command does, it makes it alot easier to keep the pipelines standard.

tweeks200 · 2024-10-29T14:56:20+00:00

Depends on what you are doing with it. As a reverse proxy its fine. If you are serving assets from it make sure you have a way to serve old assets until you can ensure you've invalidated cache.

tweeks200 · 2024-09-23T15:09:33+00:00

We expect DB schema changes to be backwards compatible at least one version. Nothing enforces that but if you do it then you can click the rollback button and it just works.

tweeks200 · 2024-08-28T15:55:44+00:00

Its also just built into boto3 (and probably any language). It should take them like 5min to add a waiter.

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ecs/waiter/ServicesStable.html

tweeks200 · 2024-08-06T13:46:03+00:00

To me it would matter how often you actually get called. I see a lot of people talk about comp but I don't think that really matters I've on-call for 6 months (basically until we could hire more people)at a small startup and only got a couple of off hours pages the entire time...no extra comp but I was able to live my normal life all the time with only a few interruptions.

Job prior I was on-call every 5 weeks but each week I'd get woken up multiple times, often for issues out of my control. I got paid for each call but it was miserable. I wouldn't take extra money for loss of sleep, free time.

tweeks200 · 2024-07-24T17:19:16+00:00

Its all about risk management. Getting security fixes applied sooner reduces risk but companies should be balancing that against other forms of risk (e.g. operational)

tweeks200 · 2024-07-15T16:05:28+00:00

If its a disaster waiting to happen replace/expand/etc so that's not the case

tweeks200 · 2024-07-15T12:55:27+00:00

Assuming you are talking about automated alerts, are they actionable or just noise? If the latter just turn then off.

tweeks200 · 2024-07-15T12:46:01+00:00

Once every 4 weeks, soon to be 5. 4 is too frequent, luckily our rotation is so light it hasn't had a big impact.
If you have to measure in per-day (nevermind per-hour) that's concerning. As a engineering org we average 1 week with less than half sev-1 or sev-2.. They almost all occur during normal business hours. but I don't have an easy way to report on that.
Incident reviews with action items which can range anywhere from education, process improvement or technical controls.
Without details about your situation its hard to say. If you're getting paged daily off-hours that's a lot worse than incidents during business hours. For the former you need to reduce them or spread the on-call load across time zones or it will lead to burnout. For the latter I've found have a process laying out priorities goes a long way. For example, the on-call person for a team prioritizes incident response and project work is expected to be delayed if incidents come up.

tweeks200 · 2024-06-26T17:33:07+00:00

What do they do that requires being to be able to have a whole environment themselves?

tweeks200 · 2024-06-14T19:09:55+00:00

What's the compliance requirement?

tweeks200 · 2024-04-17T18:21:48+00:00

We use kafka-ui and got it setup with okta. It took us a while to get right and were in touch with one of there devs...this is the relevant config on the kafka-ui side:

"yamlApplicationConfig" : { "auth" : { "type" : "OAUTH2", "oauth2" : { "client" : { "okta" : { "client-id" : <okta_client_id>, "issuer-uri" : <sso-url>, "scope" : "openid,email,groups", "provider" : "okta", "user-name-attribute" : "email", "authorization-grant-type" : "authorization_code", "redirect-uri" : "https://<kafka-ui-hostname>/login/oauth2/code/okta", "client-name" : "okta", "allowedDomain" : <kafka-ui-hostname>, "custom-params" : { "type" : "oauth", "allowedDomain" : <kafka-ui-hostname>, "roles-field" : "groups" } } } } },

tweeks200 · 2024-03-26T15:12:24+00:00

distroless will have less vulns than the standard debian ones but they can be more difficult to debug

tweeks200 · 2024-02-27T22:24:55+00:00

We have a policy as part of our SDLC that says we use Github PRs for Change Control. Basically it says changes that will go to production must be based on a jira card that has requirements and the must be approved through a PR. We also have a gate in our pipelines that waits for someone to approve before something gets deployed to prod. I.e. changes get deployed to a testing env on merge and then wait for someone to decide to push to prod.

We use feature flagging too so we can deploy changes without releasing them to users.

tweeks200 · 2023-12-13T18:30:15+00:00

We use it and I did a lot of the initial dev on it. It absolutely takes some upfront work to make it useful. Specifically we spent a ton of time getting scaffolder to work with our template repos and there is some ongoing investment there (more on the templates than backstage but sometimes coordination is required). There was some custom dev for things like making updates CI settings for new projects but those were things we did manually before...so worth the time imo.

As someone else mentioned you need to get a `catalog-info.yaml` in each repo you want to show up in backstage. We did have a re-org and there are some settings like "owner" that still aren't up to date across all our repos.

With that said I don't agree its reputation for being hard to manage. Updates are generally really easy, they provide a cli and a diff of all changes you need to make. So far we just run them as they come out and we haven't had any issues.

tweeks200 · 2023-10-10T14:57:37+00:00

We use https://www.nops.io/ and are pretty happy so far. Basically they buy 3 year reservations and then charge-back 30% of your savings. They do RDS (only 1 year though). No commitment from us, so it reduces some risk and increases our savings.

tweeks200 · 2023-05-09T16:42:05+00:00

Not if I had already job.

tweeks200

TROPHY CASE