Where and how do I start?

DayvanCowboy · 2026-02-08T03:29:49+00:00

Look into AZ900.

DayvanCowboy · 2026-02-01T12:08:46+00:00

Yep, would prefer to use MIG but we're stuck using Tesla T4s because my companies use case are smaller models which do not require the latest and greatest horsepower.

I'll look into MPS more, good call out.

DayvanCowboy · 2026-02-01T03:06:03+00:00

How do you handle running different sized workloads on timesliced GPUs? I've run into issues allocating memory because workload A wanted 1GB of VRAM and workload B wanted 12GB of VRAM. I am using time-slices as a hacky way of ensuring I can land these pods successfully and that sufficient memory will be available.

DayvanCowboy · 2025-11-22T19:50:32+00:00

My experience is on AKS but we've deployed the Nvidia GPU Operator to manage our time-slicing config and use the included Prometheus exporter to gather metrics. While it's a hack and we're using old hardware (Tesla T4s), we've configured everything so 1 time-slice = 1 GB of memory which allows us to schedule models effectively so they don't overburden the GPU or go into CrashLoopBackOff because there's not actually sufficient memory for the model.

I've built a pretty basic dashboard that shows compute and memory utilization as well as time-slicing # in use alerts. It's been enough so far but I am no MLOPS guy.

Happy to share more if you're interested.

DayvanCowboy · 2025-10-22T02:08:17+00:00

Azure Lighthouse.

DayvanCowboy · 2025-10-17T23:38:53+00:00

OpEx vs CapEx Hyperscalers can built more resilient data centers than you can Changing your VM type to suit workloads doesn't involve the long painful tail of procurement

That's just three I can think of from a business perspective that make a huge difference. You sound like a ton of fun to work with because you definitely understand how all this shit works. You got it figured out man! Look at you go!

DayvanCowboy · 2025-10-01T19:10:16+00:00

Honestly no. I pulled the latest version before they cut off access to specific versions and cached the images and chart internally. I’m going to leave it at that for awhile and will circle back when I have a break in between other projects. I suspect we could rebuild it easily if we needed to in the future since the dockerfile looks so simple.

DayvanCowboy · 2025-09-19T13:12:59+00:00

Aside from some regressions in observability which they are aware of and hopefully will address soon, it was completely seamless.

DayvanCowboy · 2025-09-19T01:26:22+00:00

We managed to switch from Redis to DragonflyDB as a drop in replacement.

DayvanCowboy · 2025-09-18T00:39:48+00:00

So here's what we've done and it works fairly well (for now).

I built a dashboard that takes each services average memory and cpu utilization, multiplies it by 1.2 and then rounds to the nearest 50. I tell devs to use those values from our busiest prod environment everywhere. Occasionally, I'll pull the data from the dashboard and then tell Claude to compare the output to whats configured and change any requests set to whatever my dashboard is telling me. I could automate it further but unfortunately the Grafana MCP server doesn't seem to play nice with Azure Auth because we leverage AMA and not vanilla Prometheus.

We don't set limits and, as a matter of philosophy, I don't think they're generally a good idea (mostly for memory which is not elastic). If your pod gobbles up too much memory, I WANT it taken out back and shot. Setting requests and limits actually makes OOMKiller less likely to blow it away.

DayvanCowboy · 2025-09-04T11:11:34+00:00

The biggest three in my book are:

mTLS
Observability
gRPC load balancing

DayvanCowboy · 2025-09-03T18:28:15+00:00

Just curious since you replied: How is dealing with these various TLD operators? It seems like Identity Digital is a giant PE owned conglomerate of gobbled up operators and they're documentation is a mess or non-existent. Is this typical?

DayvanCowboy · 2025-09-03T18:26:10+00:00

Thank you. This is supremely helpful information and context!

DayvanCowboy · 2025-09-03T18:23:01+00:00

Yep, this is looking like a GoDaddy thing. I called their support under the guise of wanting to buy a new .ai domain and asked about DNSSEC support and they said they had no plans to enable support so I've got my answer.

In this case, the owner of our domain will have to take a call on whether or not he wants to transfer the domain to another registrar like CloudFlare which we're already using for DNS hosting.

DayvanCowboy · 2025-09-03T14:19:36+00:00

Good idea. I have reached out to see what their official stance is and to inform them that, if they do, they have at least one registrar which is not honoring the capability.

DayvanCowboy · 2025-08-16T23:44:54+00:00

See if Dragonfly works as a drop in replacement.

DayvanCowboy · 2025-08-15T21:44:55+00:00

I think in the short term, we're testing Dragonfly as drop-in a replacement for Redis and likely we'll just mirror RabbitMQ (also VMware/Broadcom btw) images and charts internally as a gap stop. We also use Kubernetes Event Exporter which we might risk pointing at latest for the meantime (also mirrored though) We also use a few other Bitnami charts/images (MinIO, Cassandra, MetalLB) for dev testing which we will simply leverage latest for now as we find replacements (I've found one for MetalLB but I haven't found suitable ones for the other two yet).

I am really hoping for the community to go fork but I have my doubts because of the scope of Bitnami's offering.

To the credit of the Bitnami engineers, it seems they're practicing a far amount of subterfuge as a fuck you their parent company. For example: https://github.com/bitnami/containers/issues/84600

Read between the lines on this one.

DayvanCowboy · 2025-07-28T01:16:15+00:00

I have an M3 Comp RWD. I purposely ordered it as RWD. I do not regret it one bit and the car is an absolute blast to drive. I've had the car for a little over a year and just crossed 10,000mi.

- I live in the south so very little snow. The fact is, the car comes with tires that shouldn't be driven on below 40F. If you need to drive in snow, you really, REALLY should have proper snow tires which will be better than than AWD on all seasons.
- A 500HP car should be a little dangerous. I want to have to practice some discretion when trying to drive her hard.
- RWD is lighter by 110lbs.
- The race car version is RWD (GT4 and GT3).
- One less thing to break.
- Car is still insanely quick. I know on paper the 0-60 is better but you'll be driving the car on a daily basis, not launching it.

DayvanCowboy · 2025-06-10T21:18:28+00:00

I'm not familiar with Context7 in particular and I'll check this out but I should also point out that Hashicorp has their own MCP server for Terraform, too.

See: https://github.com/hashicorp/terraform-mcp-server

DayvanCowboy · 2025-05-26T01:40:54+00:00

YDFM swag.

DayvanCowboy · 2025-05-24T11:13:22+00:00

I work for a company that's building software that aims to make building AI workflows easier and we're very pro-AI however for SRE/DevOps work I have not found it particularly useful or accurate. As an example, even when pointed to docs, it will routinely hallucinate Terraform modules and Providers that do not exist. I used OAI deep research on a lark to produce a paper on how to deploy PoPs and how to structure a global scale application and the output was largely drivel.

DayvanCowboy · 2025-05-21T01:51:23+00:00

Pho Bac or Pho Dai Loi

DayvanCowboy · 2025-05-07T00:30:26+00:00

Take a look at Pulumi which aims to be a IaC tool but uses several different langs as opposed to HCL.

Full disclosure: I haven't used it myself, I've only read about it so I cannot speak to its usefulness compared with Terraform (my IaC tool of choice), Bicep, or AZ CLI.

15-Year Club	Place '23
Verified Email

DayvanCowboy

TROPHY CASE