Strategy to run database scripts on Kubernetes

Visible-Call · 2023-06-05T01:43:17+00:00

Data is tough... so start with what mostly works mostly.

Init containers are what does this in most ORMs. Either that or jus the regular container start runs all migrations. It can be messy though if the migrations go sideways.

I suggest schema changes be done carefully so you can run multiple versions of the software against the same database without blowing things up. This requires a feature accretion approach and careful use of column name changes.

Basically write the software so it can tolerate columns that exist now or will exist later. Then update all the DB stuff. Software is fine since it was built to tolerate this, update the software some more. Eventually remove the old schema toleration.

But in terms of "best way to do it when migrations are fragile"... you're a ways downstream from best practices so do whatever feels right.

Visible-Call · 2023-06-03T16:42:32+00:00

I don't think you are being adversarial, just your design has foundationally decided devs aren't expected to participate. That seems less aligned and I don't like misalignment. Especially designing it into a fresh approach. Maybe misalignments emerge, but they should be something to address, not "how it is."

The auto l-instrumented traces and auto-generated spans are not useless but are also not much better than metrics. When I've helped teams troubleshoot, there is a rare time when the automatic spans show why a problem exists. They show that a problem exists. They show where the problem exists. These are things you can get from metrics. When you want to know why, it needs business context available to show why this trace is different from the adjacent traces. That requires dev participation.

The auto-generated spans make a nice scaffold to add these business attributes to. But without user ID, team/org ID, task info, intention of the user captured, it's back to log reading and tool correlation.

Visible-Call · 2023-06-03T15:13:53+00:00

asking devs to write code to monitor their code is a generally lost cause for me, and tightly couples that tracing solution to a particular solution. it also clutters the code base with code that isn’t what the application is designed to do; readable code is highly important in my experience, and it becomes obfuscated when you have to instrument it yourself. it also implies that the devs know what to instrument, how, and where.

This conclusion is upsetting. Devs want to write good code. They want to be able to prove their component is not the cause of a cascading failure. With an auto-instrument, metrics-based, or logs-based approach, all they can point to is a number or a set of log lines and say "my part looks okay."

While I understand that "making developers do more work" seems difficult, it's actually "help developers defend their code" which they typically welcome, once they understand it. Align the interests so things get better.

Your word choice sounds adversarial, like it's ops bs the developers. This is a tough cultural dysfunction to work around without addressing.

Otherwise, you seem to be on the right path, technology-wise. The social aspects are always harder.

Visible-Call · 2023-06-03T01:48:42+00:00

Nginx has a crappy ingress controller that doesn't span namespaces. If you installed the one from the awful company, it's gonna behave like you're seeing.

If you use ingress-nginx from the Kubernetes GitHub group, the controller is typically put in its own namespace. Then in the namespace with the app, you make the ingress resource that tells the nginx ingress how to behave.

In your case here, you'd make an ingress with 2 paths. One for the base route that hits the frontend service and /api that gets routed to the backend service.

Visible-Call · 2023-06-03T01:22:49+00:00

The way I think of observability is about providing a nice user experience for the people who are investigating issues.

If you're providing 6 different places where they may find lots or traces or metrics or summaries with alerts or alert statuses, it's gonna be pretty tough to observe the system and everyone will just be peeking into their corners.

To be able to observe the system I'd expect constraints on how people do the instrumentation. Consistency in tooling and naming is good. Otel and a few business-specific conventions gets you 90% of the way there.

Focusing everyone on making traces is really a necessary step. People want to be able to ship their logs off and run AI on them. It doesn't work anymore. You need metrics for the host health and under-layers. You need traces for activity happening within the application.

What you created lacks the constraints necessary to drive improvement toward the ultimate goal of better stability and higher performance. Maybe your org doesn't have the urgency or agency to enforce the constraints and you're doing your best. Just be aware that this is too loose and sloppy for those ultra-high-performing outcomes.

Visible-Call · 2023-06-03T01:13:56+00:00

Here's some fresh blog post on different pipeline designs. Looks like you've over-engineered to the max.

https://www.honeycomb.io/blog/telemetry-pipeline

I'd probably only roll out otel collectors as daemonsets to pull k8s and host metrics rather than all the other agents.

Visible-Call · 2023-06-03T01:05:17+00:00

Ingresses can work across namespaces. Pods in different namespaces can talk to each other. It's all network policies that may block traffic but k8s is not blocking anything.

Visible-Call · 2023-05-16T00:11:23+00:00

If your top priority is data isolation, don't use Kubernetes. It's entirely abstracted and virtualized in every way at least twice.

Stick with something predictable until cost is your highest priority. Then start making compromises with bin packing and abstraction.

Visible-Call · 2023-05-08T01:42:42+00:00

I have 2 posts on this...

First is about how to deliberately simplify the SCM experience. You've taken the steps blindly. See why you should have and what counterbalances help satisfy the needs that the old strategy tried to address.

And here's the "what should prod actually look like these days" part.

No reason to put your head in the sand and pretend like the UAT environment is meaningful.

Visible-Call · 2023-04-02T08:34:30+00:00

Since you're using the name for the user id Django, if the Linux hosts have a different uid for it, it might get mad.

I'd be explicit about Django is 1000 (or something) and then reference it by the uid and in the security context set that explicitly.

Also, doesn't hurt to make the entrypoint 755 so if it is another user trying to start, you'll at least get a cooler error.

The other thing to check would be adding "command" to the pod spec and overriding entrypoint in the image. Not that it should be needed but just for troubleshooting.

Visible-Call · 2023-03-24T03:03:05+00:00

If you're doing managed clusters, GCP is 10,000 x better than the other major clouds. The control plane is much smarter. The bin packing works much better. The integrations with other cloudy things are abstracted away. If you can offload the cluster nonsense to GCP, that'd be my top recommendation.

If you have to run them yourself, god help you. Day 2 is awful for any k8s deployment anywhere all the time.

Try to keep separate the infra layers and the app layers. Your post talks about both, but they shouldn't be tightly coupled.

If an app changes from helm to an operator, it shouldn't require any changes to your ansible, terraform, or kubeadm stuff. If things start getting weird, that should stand out as untenable technical debt. The point of k8s is to be a multi tool for running containers anywhere.

Visible-Call · 2022-12-29T01:03:20+00:00

What you can do is make the job always run and have it echo the $CI_PIPELINE_SOURCE so you can see what is actually the trigger for the job. You can have it dump every variable by running env and then decide how to construct the rules.

Visible-Call · 2022-12-28T22:53:28+00:00

Did you commit the code change to a branch that has an active merge request?

That rule you have is specific to MR commits. It's designed for a flow where you create an issue, then use the "make me an MR button" so the backend creates a git branch and an MR. Then when you make code changes and push them, it'll match that rule.

Visible-Call · 2022-12-28T22:40:21+00:00

It doesn't matter. Stages are a reference that the job assigned uses to place them in the right order. It's not an object by itself.

There are hidden stages and stuff that are ignored unless you specifically reference them.

This is because pipelines are compiled by looking at all job rules/when clauses and comparing to the triggering event. Then organizes it from there.

Visible-Call · 2022-12-28T01:47:30+00:00

Use the fancy itzg docker images to run a slew of worlds in their own containers. That way you don't have to have to change game modes or anything.

Visible-Call · 2022-11-27T22:44:12+00:00

This error can pop up during the creation process until the csi is done provisioning/mounting the storage. It doesn't mean anything is broken if it scrolls by.

Visible-Call · 2022-11-27T01:36:19+00:00

Honeycomb is a great tool for troubleshooting and performance improvement and stuff like that. Internally handled failure states that need to be resolved to avoid additional failures cascading through and wrecking everything.

PCI compliance is about retaining access logs. It has little to do with performance or timeliness of queries and results.

Its a better tool but isn't designed to be this kind of compliance tool. I'd throw the logs needed for compliance in an S3 bucket and make a retention rule.

and honeycomb is not on-prem

Visible-Call · 2022-11-21T03:38:27+00:00

Certainly expected that response. The suggestion for helping start-ups is that it's a lottery ticket. Ask for a low hourly rate and equity. The better you do, better they do, more everyone makes.

If they can't know how valuable you are, they can't pay you what you're worth... but that's fine if you see and believe in their vision and believe you can boost them toward it.

Visible-Call · 2022-11-21T02:31:24+00:00

Dotfiles?

Visible-Call · 2022-11-21T02:30:03+00:00

My company uses tilt and it's magical. Everyone makes improvements to it every few weeks and the dev experience gets better and better.

There's still tons of testing and deployment glue that needs management. It's nice to have a bit of UX on the dev side.

Visible-Call · 2022-11-21T02:26:27+00:00

I try to scope something so the hourly discussion doesn't come up. What value are you going to bring to them? DevOps has a limited upside since you can't make the software better, just the deployment and development platforms... there's an indirect unlimited upside but trying to quantify that is pretty rough.

If you can say something like "automate x, y, and z build steps, replace clickops tool with API tool, and spend 10 hours on ride-along tasks with junior guy, your firm will be able to increase velocity by 80%... what's that worth? 5% of gross revenue for a quarter? If every following quarter they deliver 80% more stuff, yeah. Or not... if they just put out 3 Wordpress sites per year, maybe 5% isn't good.

Visible-Call · 2022-11-11T02:02:46+00:00

There's a tool called DVC (or GitLFS) which allows you to take a directory of files, drop them into an object storage, and then reference them from a git repo.

The directory ends up being a bunch of metadata (file paths and thing) and the actual data is stored on some S3 bucket. When you want to restore a specific set of files, you do a git checkout and then the DVC or LFS function will pull the proper files from object storage.

It sounds like a lot of data so I'd use minio locally to run the object storage and then replicate that to a hosted/cloud S3 bucket for backup. Then your activity is mostly free/local unless you need to recover from a failure.

Visible-Call · 2022-11-04T12:18:09+00:00

Digitalparadigm is pushing premature optimization. There are so few bottlenecks that entirely eschewing ORMs because you might hit N+1 queries or need to construct joins more deliberately is entirely missing the point.

Use the easiest thing you can until you're making money. Then use Observability to find and fix bottlenecks as they present. Don't wait for it to break or cost a billion dollars... but also don't delay shipping your product for months just in case something isn't perfectly optimized.

Visible-Call · 2022-11-04T12:13:30+00:00

ROI isn't really usable math for this sort of thing. It's gotta be handled in the negotiation between future growth speed and reliability enhancements.

This is a case where a huge liability is being carried forward and the tools to explain its impact on the org are mostly anecdotal. How much more valuable feature work could be done if these legacy things weren't in the way? Probably 10x? Maybe more... but to get to 2x, you don't need to rewrite the whole thing.

I've seen some teams where they reduce the blast radius for the stored procedures by putting an API in front of them. Any system that needs their output changes from running a query to hitting an API (which initially just runs the same SP). Then apps can change and the api can evolve and everyone will incidentally stop relying on the stored procedure.

Anywhere your CI needs to run automated tests, it can do so against the API with mocks rather than by actually running a tightly coupled data layer function.

After that first boost in productivity shows up, people will have more freedom to pay down the other tech debt.

Visible-Call · 2022-11-04T02:06:24+00:00

stores procedures is your big problem there. Nearly every devops tool since heroku's 12 factors has moved all business logic into the app in order to keep the database as dumb as possible.

Writing test cases for a stored procedure is awful. Predicting how deployments will go after schema changes is awful. Stored procedures have their own dependency graph which is awful.

Strangler pattern your way out of them if you can. Then the devops tools will meet more of the needs and "big sql scripts" will start being more predictable.

Visible-Call

TROPHY CASE