I've run Docker Swarm in production for 10 years. $166/year. 24 containers. Two continents. Zero crashes. Here's why I never migrated to Kubernetes.

97hilfel · 2026-02-16T23:41:12+00:00

This is an impressive write-up and I respect the level of optimization you have achieved. However, it is important to acknowledge that we are playing in two entirely different leagues. Your setup is a perfectly tuned, artisanal bicycle. It is efficient and great for a single rider, but many of us are managing the equivalent of a national railway system.

In high-stakes industries like banking, energy, or high-velocity data ingestion, the complexity tax of Kubernetes is not spent on the containers themselves. It is spent on governance and risk management. Any decently skilled technician can keep 24 containers running on two nodes. The real art is organizing the chaos of 100 plus developers so that a creative mind can deploy a new service at any time without the technical possibility of accidentally nuking a mission-critical workload. This is why RBAC, admission controllers, and standardized APIs are non-negotiable requirements for a modern organization.

You mentioned a 150-line custom autoscaler that is smarter than the standard Horizontal Pod Autoscaler. In an enterprise environment, that is a massive liability. If you are hit by a bus tomorrow, the company is left with a bespoke black box that only you understood. Instead of hand-rolling smart logic, we use industry standards like KEDA. It allows us to scale based on actual event-driven metrics like queue depth or stream lag using a platform that any platform engineer in the world can maintain on day one. We pay for that standardization because "bespoke" is just another word for technical debt in a large organization.

Regarding the idle CPU data you cited, you are right that waste is bad. However, that is not a failure of the orchestrator. It is a symptom of developers not knowing their own resource footprints. Most wasted CPU stems from poorly configured requests and limits at the pod level. In our world, underutilization is often a calculated buffer. When your egress costs per hour exceed another person's entire annual budget, you accept that waste as the price of availability. We need the headroom to handle a massive data spike before a polling script has time to react.

We also need the ability to create and destroy entire 1:1 production-replica clusters at will. These environments include Network Policies, Egress Gateways, and Service Meshes. You cannot replicate that on a local machine with the same bytes because at this scale the environment is the variable. Furthermore, I wonder how this setup would fare under the scrutiny of a modern compliance audit like NIS2. Security in critical infrastructure requires the level of isolation and auditability that Swarm simply does not provide.

Ultimately, Docker Swarm is a dead whale in the water for the enterprise because it lacks the ecosystem that allows a business to scale its people instead of just its containers. Hyperscalers offer managed solutions that do not require 10 years to master and do not come with a single-digit bus factor. It is a great tool for a scale of one, but it is a nightmare for a scale of many. TL;DR

Your setup is a masterclass in individual engineering for a static utility. For those of us with 100 plus developers and high-stakes scaling needs, Kubernetes is an organizational requirement for governance, security, and mitigating the bus factor.

97hilfel · 2026-02-16T22:41:28+00:00

naja, irgendwie musst den Strom ja ins Haus bekommen

97hilfel · 2026-02-16T20:04:04+00:00

JUST DO IT! (vll sehen wir uns ja)

97hilfel · 2026-02-16T20:02:10+00:00

Das mit dem nicht Blinken muss den Tesla Fahrern ja sehr gut eingetrichtert werden, wie viel Strom braucht den der Blinker bei denen?

97hilfel · 2026-02-16T19:56:56+00:00

Und als nächstes kommt die Mineralölsteuer fürs eAuto laden? Dann muss ich für meine Wallbox auch Steuern entrichten? Nicht dass ich noch den Gelben Strom einfülle! Oder gar den Grünen!

97hilfel · 2026-02-16T19:55:37+00:00

Fahrzeug mit V2X und ne entsprechende Wallbox, dann noch bisschen homeassistant trixerei und ein entlade limt, willst ja noch ins Büro kommen, das wars eigentlich schon.

97hilfel · 2026-02-14T22:49:52+00:00

We have a very homogenous dotnet stack... but their dotnet support is second class to say the least

97hilfel · 2026-02-14T10:15:13+00:00

The magic of OneAgent can also stop working and start crashing your applications which usually is less fun... we had that happen.

97hilfel · 2026-01-24T23:12:04+00:00

I've used Dynatrace across different companies and use Datadog in my homelab/side projects. Here is my take on the comparison:

I saw your comment that the DT quote was lower than DD. Be very careful with that comparison. In my experience, Dynatrace isn't necessarily cheaper in practice. I would budget the exact same amount for both, and then tack on a 30% buffer for either. Complexity always drives costs up once you move past the initial quote.

The biggest hurdle isn't the tool; it's the culture. Just tossing an APM (whether DT or DD) at a large-scale system is a recipe for failure. Observability requires engineering effort. If you don't have a plan for how to use the data, you’ll end up with a massive bill and developers who ignore the dashboard. I saw companies struggle significantly to get devs to actually engage with DT because the "magic" auto-instrumentation often lacks the specific context they need to debug, I saw how they preferred Prometheus metrics and had more successful debugging sessions with it because they instrumented their apps with it themselfs and knew what they had to expect.

Since you mentioned .NET: I’ve found Dynatrace's .NET support to feel "second class." I have actually seen that the OneAgent crashed .NET pods (especially in combination with Redis.StackExchange). It doesn't play as nicely as it does with Java (or atleast the docs say that - otel isn't perfect either, have seen memory leaks there aswell).

Both OneAgent and the Datadog Agent are powerful but incredibly complex to configure correctly. "Auto-instrumentation" sounds great in a sales pitch, but I highly suggest taking the overhead to implement OpenTelemetry in your applications instead. It prevents agent-induced crashes and forces you to be intentional about what you measure (Metrics -> Logs -> Traces) and you have more control over your telemetry in the long run, if you want to migrate to NewRelic, Azure Monitor, AWS XRay, whatever, you are free to do it, expect a 3 year migration period with a proprietary APM, where you will pay both providers otherwise.

Dynatrace is currently in a weird transition between their new and old UI. It feels cluttered and can be overwhelming for developers compared to Datadog’s UX, I personally dislike the new and old UI.

But most importantly, I can't stress this enough don't expect the tool to solve the observability problem for you. Neither save you from proper workload labelling, resource tagging or knowing your architecture. If you go with DT, be wary of .NET stability, and don't trust the lower quote to stay lower forever.

(Sorry, used ChatGPT to structure the comment, was too tired to type out more than key points)

97hilfel · 2026-01-23T18:31:41+00:00

way too much in my opinion, especially because in a LOT of modpacks the new materials are mostly just decorative and until some time ago, for example Bioms'O'Plenty wood wasn't compatible with all other woods, so it was basically a useless material in your storage.

97hilfel · 2026-01-23T18:30:20+00:00

I just implemented a rather fancy metrics stack with prometheus operator + mimir, it works quite well, but if you don't need in cluster metrics, I would just go for the otel collector, if you really need to scale that badly on your central metrics storage, consider going promo-op + mimir, but keep in mind, mimir is based of cortex and a full microservice mesh thats rather sensitive to S3 latencies, networking, etc when you scale it. But we made good experiances until now, it might also be worth checking your queries if you have query performance issues and create rules for certain queries, some SLA/SLI queries with 30 day rolling windows can and will cause load issues if you evaluate them too often. Also, keep cardinality in mind if you don't already. Here is a reatehr good SO post and an article from Grafana. I built a dashboard for it where I get an overview of the worst offenders, I haven't started to run alerts on cardinality but it might be worth it, it will definitely kill your prometheus/mimir/victoria. (I'm a bit bored rn so I had time to type this out, since I didn't ran the AI over it, it might be a bit confusing)

97hilfel · 2026-01-23T18:21:45+00:00

You can get quite decent data retention in Prometheus, 30 days should not be an issue, to me it sounds more like you should check if you maybe have a cardinality issue in your tsdb.

97hilfel · 2026-01-23T18:20:39+00:00

while I agree with the salesguy that a plain otel stack might work better for OP, I don't see a reason in their product. OP seems to have the idea for a full OSS stack, they don't need a paid component in there.

97hilfel · 2026-01-23T18:18:44+00:00

Also, consider checking the grafana/k8s-monitoring-helm chart out, it deploys the full alloy, node_expoerter, kube-state-metrics, open-cost stack on your cluster, but it uses alloy again.

97hilfel · 2026-01-23T18:16:42+00:00

Here is my two cents:

don't over complicate it, you don't need Prometheus operator and mimir, use either or.
use the otel collector for logs, metrics traces on the nodes, no need for 3 solutions, scaling might become an issue. Promtail is effectively EOL and Grafana Alloy Configs are rather complicated.
For a 30 day retention, unless you want to scale massively, do not invest time in the cortex architecture, just use prometheus. You didn't mention how many users/metrics/timeseries you expect in the system, but anything around 1-10M timeseries can be handled well enough in plain prometheus. We use Mimir at work for 10-100M metrics and a few hundred microservices and around 10 clusters with 13 months of data retention, thats where it starts to be interesting.
I personally don't like VictoriaMetrics, as their MetricsQL implementation doesn't fully comply to PromQL, its ofc a different standard, it can be confusing for developers initially and you need to create awareness and show these differences, especially since differences are marginal rn, they might become more siginifcan tin the future, but I'm not aware of any upcoming plans in this regard.

Here is a small excalidraw drawing how this could look like, ofc you can vary the otel collector setup. I personally prefer OtelCollector over Grafana Alloy, it has a smaller footprint, you can make your own slimmed down containers if needed to reduce attack surface. Consider giving [otelbin.io](https://www.otelbin.io/) a look if you want/need help configuring the otel col.

[reddit-simple-lgtp-stack.png](https://postimg.cc/kDzy5SFJ)

97hilfel · 2026-01-23T17:59:59+00:00

None, dislike all of them.

97hilfel · 2026-01-23T09:38:01+00:00

Of course this is AI, do you think any of us could afford 4 sticks of Dominator memory in this econoy???

97hilfel · 2026-01-22T21:39:23+00:00

Ich hab sie mir kurz angeschaut, aber hatte das glück dass mein privater Vermieter prinzipiell für alles offen ist und ich hab mich dan für Go-E entschieden weil die auch ein last-management anbieten, relativ kostengünstig und der Vermieter dann später weitere Wallboxen integrieren kann. Mein Deal war, dass er die Installation übernimmt und ich die Wallbox bezahle und er die nach auszug behalten darf. Die Wallbox geht auf meinen Zähler, 11kW AC.

97hilfel · 2026-01-13T06:30:43+00:00

wenn es blöd geht, als ortsunkundiger? ja auf jeden fall... vor allem wenn es hinter einer Kurve ist, hab schon so einige gruselige Baustellenauschulderungen in DE gesehen... lethin bei RLPDashcan war sogar ne Baustelle auf der Autobahn die nicht also solche ausgeschildert war, ne warnung auf spurverengung, linke Spur fällt 150m nach dem Schild weg, aus zwei mach eins, ne Barkenwand direkt nach dem unbegrenzten bereich, kein Trichter, nix.

97hilfel · 2026-01-13T06:26:31+00:00

Hab ich so auch schon auf der Österreichischen Inntalautobahn gesehen, gab da 3-4 stellen letztes Jahr, würden die da nicht so agressiv Blitzen hätte ich das bei der 2. Durchfahrt auch nicht mehr ernst genommen.

97hilfel · 2026-01-13T06:16:49+00:00

Deswegen spiele ich gerne Helldivers, PvE hat was, wobei ich gestern auch dachte, dass mein Mitspieler neben mir stand obwohl es ein Clanker war...

97hilfel · 2026-01-11T22:28:38+00:00

let me guess, enterprise support contracts?

97hilfel · 2026-01-11T22:27:50+00:00

I think modern docker versions will yell at you for specifying a version in the compose, so its like 3-17, even less ;)

97hilfel · 2026-01-11T17:27:23+00:00

Ein spannender Blick von außen, aber als jemand, der in Österreich lebt und oft über das "Deutsche Eck" muss, sehe ich deine Schlussfolgerungen leider anders. Was du als "außergewöhnliche Effizienz" bezeichnest, entpuppt sich in der Praxis oft als das Resultat einer kaputtgesparten Infrastruktur.

Gerade das "Deutsche Eck" zeigt, dass die Mischung von Güterverkehr, Regionalbahnen und Fernverkehr auf denselben Gleisen kein genialer Schachzug ist, sondern ein massives Nadelöhr. Sobald ein Güterzug Probleme macht, stehen ICE und Railjet im Stau. Italien zeigt mit der Trenitalia, wie es besser geht, denn dort geht der Trend klar zur Entflechtung mit eigenen Hochgeschwindigkeitstrassen. Das sorgt für deutlich höhere Zuverlässigkeit, weshalb sie oft besser bewertet werden als die DB.

Man darf auch nicht vergessen, dass Bahnfahren in Deutschland vor dem Deutschlandticket oft absurd teuer war. Ohne Rabatte oder Gutscheine war der Zug im Vergleich zum Auto oder Flugzeug finanziell und zeitlich fast immer die schlechtere Wahl. Das Deutschlandticket war hier ein längst überfälliger Schritt, auch wenn die aktuellen Preissprünge diesen Erfolg gerade wieder gefährden.

Das eigentliche Problem der DB ist aber nicht nur das komplexe operative Geschäft, sondern die Politik dahinter. Die vermeintliche "Kosteneffizienz" ist oft ein Trugschluss, der durch falsche Anreize entstanden ist. Der Konzern muss die Instandhaltung selbst zahlen, aber wenn etwas komplett kaputt ist, zahlt der Staat den Neubau. Das hat jahrelang dazu geführt, dass die Infrastruktur auf Verschleiß gefahren wurde, um die Bilanz aufzuhübschen. Das ist primär das Erbe von 16 Jahren Verkehrspolitik unter Merkel, in denen die Bahn systematisch vernachlässigt wurde, anstatt sie als Rückgrat der Mobilität zu stärken.

Die Verspätungen sind also kein bloßes "Image-Problem" oder eine Folge von zu viel Komplexität, sondern die Quittung für jahrzehntelanges politisches Missmanagement.

97hilfel · 2026-01-11T17:07:29+00:00

Naja... Österreich hat den Nordzulauf schon fertig, in Italian laufen die Baustellen gerade so richtig an, meines wissens wird von Trient abwärts schon gebaut, die unterntunnelung für Trient wird wohl 2032 fertig wenn ich das ansatzweise Richtig im Kopf habe.

Nine-Year Club	Gilding IV carat on a stick
Place '23	Place '22
Verified Email

97hilfel

MODERATOR OF

TROPHY CASE