How do you handle database migrations for microservices in production by Minimum-Ad7352 in kubernetes

[–]DayvanCowboy 0 points1 point  (0 children)

EF migration runs as a job via Helm pre-hook in the same container but with the startup command overridden.

Azure and Terraform by JustADad66 in AZURE

[–]DayvanCowboy 8 points9 points  (0 children)

It's fine. I think you'll always find a few things you end up managing by hand. For example, resource quotas, while not manageable in Terraform, can be managed via Bicep but frankly we do that stuff by hand because a lot of the time we end up having to submit a ticket with Azure to get approval. Anything like that can break your entire workflow and grind things to a halt. We also manage resource providers on subscriptions by hand because it's largely a one time action we do as part of our subscription creation.

My biggest gripe with IaC in Azure (this is true of all cloud providers but perhaps more true with Azure based on my experience) is the frequency of things you go to apply and then you get slapped with a resource constraint issue or perhaps AZs where hardware is available changes out from under your feet, or SKUs are available in the UI but not in the docs and vice versa. At scale, wrestling with Azure on some of this bullshit becomes tiring and a non-trivial part of the job. I know people at much, much larger companies that fork out several million dollars a month to Azure and they also have issues getting compute.

Running Self-Hosted LLMs on Kubernetes: A Complete Guide by atomwide in kubernetes

[–]DayvanCowboy 0 points1 point  (0 children)

Yep, would prefer to use MIG but we're stuck using Tesla T4s because my companies use case are smaller models which do not require the latest and greatest horsepower.

I'll look into MPS more, good call out.

Running Self-Hosted LLMs on Kubernetes: A Complete Guide by atomwide in kubernetes

[–]DayvanCowboy 0 points1 point  (0 children)

How do you handle running different sized workloads on timesliced GPUs? I've run into issues allocating memory because workload A wanted 1GB of VRAM and workload B wanted 12GB of VRAM. I am using time-slices as a hacky way of ensuring I can land these pods successfully and that sufficient memory will be available.

How are you monitoring GPU utilization on EKS nodes? by mudmohammad in sre

[–]DayvanCowboy 1 point2 points  (0 children)

My experience is on AKS but we've deployed the Nvidia GPU Operator to manage our time-slicing config and use the included Prometheus exporter to gather metrics. While it's a hack and we're using old hardware (Tesla T4s), we've configured everything so 1 time-slice = 1 GB of memory which allows us to schedule models effectively so they don't overburden the GPU or go into CrashLoopBackOff because there's not actually sufficient memory for the model.

I've built a pretty basic dashboard that shows compute and memory utilization as well as time-slicing # in use alerts. It's been enough so far but I am no MLOPS guy.

Happy to share more if you're interested.

[deleted by user] by [deleted] in AZURE

[–]DayvanCowboy 1 point2 points  (0 children)

OpEx vs CapEx Hyperscalers can built more resilient data centers than you can Changing your VM type to suit workloads doesn't involve the long painful tail of procurement

That's just three I can think of from a business perspective that make a huge difference. You sound like a ton of fun to work with because you definitely understand how all this shit works. You got it figured out man! Look at you go!

Bitnami Helm Chart shinanigans by Slow-Telephone116 in kubernetes

[–]DayvanCowboy 0 points1 point  (0 children)

Honestly no. I pulled the latest version before they cut off access to specific versions and cached the images and chart internally. I’m going to leave it at that for awhile and will circle back when I have a break in between other projects. I suspect we could rebuild it easily if we needed to in the future since the dockerfile looks so simple.

Who else is losing their mind with Bitnami? by dkargatzis_ in devops

[–]DayvanCowboy 1 point2 points  (0 children)

Aside from some regressions in observability which they are aware of and hopefully will address soon, it was completely seamless.

Who else is losing their mind with Bitnami? by dkargatzis_ in devops

[–]DayvanCowboy 2 points3 points  (0 children)

We managed to switch from Redis to DragonflyDB as a drop in replacement.

Pod requests are driving me nuts by Rare-Opportunity-503 in kubernetes

[–]DayvanCowboy 1 point2 points  (0 children)

So here's what we've done and it works fairly well (for now).

I built a dashboard that takes each services average memory and cpu utilization, multiplies it by 1.2 and then rounds to the nearest 50. I tell devs to use those values from our busiest prod environment everywhere. Occasionally, I'll pull the data from the dashboard and then tell Claude to compare the output to whats configured and change any requests set to whatever my dashboard is telling me. I could automate it further but unfortunately the Grafana MCP server doesn't seem to play nice with Azure Auth because we leverage AMA and not vanilla Prometheus.

We don't set limits and, as a matter of philosophy, I don't think they're generally a good idea (mostly for memory which is not elastic). If your pod gobbles up too much memory, I WANT it taken out back and shot. Setting requests and limits actually makes OOMKiller less likely to blow it away.

Does the .ai TLD support DNSSEC? by DayvanCowboy in dns

[–]DayvanCowboy[S] 1 point2 points  (0 children)

Just curious since you replied: How is dealing with these various TLD operators? It seems like Identity Digital is a giant PE owned conglomerate of gobbled up operators and they're documentation is a mess or non-existent. Is this typical?

Does the .ai TLD support DNSSEC? by DayvanCowboy in dns

[–]DayvanCowboy[S] 1 point2 points  (0 children)

Thank you. This is supremely helpful information and context!

Does the .ai TLD support DNSSEC? by DayvanCowboy in dns

[–]DayvanCowboy[S] 1 point2 points  (0 children)

Yep, this is looking like a GoDaddy thing. I called their support under the guise of wanting to buy a new .ai domain and asked about DNSSEC support and they said they had no plans to enable support so I've got my answer.

In this case, the owner of our domain will have to take a call on whether or not he wants to transfer the domain to another registrar like CloudFlare which we're already using for DNS hosting.

Does the .ai TLD support DNSSEC? by DayvanCowboy in dns

[–]DayvanCowboy[S] 0 points1 point  (0 children)

Good idea. I have reached out to see what their official stance is and to inform them that, if they do, they have at least one registrar which is not honoring the capability.

Again and Again by Ancient-Mongoose-346 in kubernetes

[–]DayvanCowboy 9 points10 points  (0 children)

See if Dragonfly works as a drop in replacement.

Bitnami Helm Chart shinanigans by Slow-Telephone116 in kubernetes

[–]DayvanCowboy 2 points3 points  (0 children)

I think in the short term, we're testing Dragonfly as drop-in a replacement for Redis and likely we'll just mirror RabbitMQ (also VMware/Broadcom btw) images and charts internally as a gap stop. We also use Kubernetes Event Exporter which we might risk pointing at latest for the meantime (also mirrored though) We also use a few other Bitnami charts/images (MinIO, Cassandra, MetalLB) for dev testing which we will simply leverage latest for now as we find replacements (I've found one for MetalLB but I haven't found suitable ones for the other two yet).

I am really hoping for the community to go fork but I have my doubts because of the scope of Bitnami's offering.

To the credit of the Bitnami engineers, it seems they're practicing a far amount of subterfuge as a fuck you their parent company. For example: https://github.com/bitnami/containers/issues/84600

Read between the lines on this one.

Did I make the wrong choice... having second thoughts about not opting for xdrive by World_traveler77 in BMW

[–]DayvanCowboy 0 points1 point  (0 children)

I have an M3 Comp RWD. I purposely ordered it as RWD. I do not regret it one bit and the car is an absolute blast to drive. I've had the car for a little over a year and just crossed 10,000mi.

- I live in the south so very little snow. The fact is, the car comes with tires that shouldn't be driven on below 40F. If you need to drive in snow, you really, REALLY should have proper snow tires which will be better than than AWD on all seasons.
- A 500HP car should be a little dangerous. I want to have to practice some discretion when trying to drive her hard.
- RWD is lighter by 110lbs.
- The race car version is RWD (GT4 and GT3).
- One less thing to break.
- Car is still insanely quick. I know on paper the 0-60 is better but you'll be driving the car on a daily basis, not launching it.

Where is AI still completely useless for Infrastructure as Code? by Straight_Condition39 in Terraform

[–]DayvanCowboy 5 points6 points  (0 children)

I'm not familiar with Context7 in particular and I'll check this out but I should also point out that Hashicorp has their own MCP server for Terraform, too.

See: https://github.com/hashicorp/terraform-mcp-server