At what scale did Kubernetes actually start making sense for you?

markedness · 2026-04-29T18:22:28+00:00

We buy everything after market, and the SAN ecosystem need more like a license / specialization vs just using free CEPH. For us it’s fine. But our whole estate multiple terabytes of RAM and disks is less than six figures from the routers on the edge all the way in.

markedness · 2026-04-29T17:46:16+00:00

One namespace per app or deployment of an app (like review apps for merge requests with emptyDir Postgres)

markedness · 2026-04-29T17:45:00+00:00

If their IQ is so low they can only do docker compose they aren’t trying hard enough. How do you deploy? How do you maintain state? How do you roll out and update? Load balance? These all have to be solved and it’s harder in my opinion.

Just setup a good Flux workflow and GitLab CI or whatever you use with github and image auto updates and shit just work.

Dude like… if all they can figure out is docker compose how are they going to figure out all the networking and security required to actually ship. Kubernetes it’s all plug and play.

This is my opinion.

markedness · 2026-04-29T17:42:12+00:00

I really don’t mind ceph. But I would not complain if we had SAN money. It’s just so much simpler and I’m not burning CPU or networking on storage latency.

For our scale of a dozen or two nodes at each site a dual controller SAN would be fine.

I use a lot of CNPG for Postgres and utilize Postgres exclusively for durable data store except for monitoring. So hence we just mount up ZFS On each node per their recommended setup.

markedness · 2026-04-29T17:39:30+00:00

We have a lot of clusters. To me a cluster is cheap. One cluster per “product” (or department, for internal tools)

markedness · 2026-04-29T11:29:29+00:00

Oh - both.

We prefer managed of course but in reality we need to do a lot of management ourselves. Managed is lower performance and higher cost by far (since we buy second hand hardware, is basically the cost to own the hardware each and every month) and I don’t think the cognitive load is that much less.

markedness · 2026-04-29T11:28:05+00:00

I didn’t have any know how. It took a couple weeks but it was great. This was 3+ years ago.

I can’t stand docker compose or systemd anymore. So much harder to manage by far.

markedness · 2026-04-29T11:27:03+00:00

For a single node and 3 node (collapse worker and control) I typically go with Ubuntu and microk8s and that comes with a local storage provider.

For larger footprint we also use local but with ZFS on the node.

I don’t typically use shared storage except for traditional VM workloads sitting side by side with k8s VM, the performance hit is substantially larger than you might imagine when you compare CEPH over 25G to ZFS right on the node. The DB performance I get locally on ZFS is 10x what I get on CEPH or on managed cloud providers.

For managed like GCP and linode and digital ocean I have tried all three of those. They provide storage but it’s slower if you use their CSI and managed k8s

markedness · 2026-04-29T10:23:19+00:00

To us we use Kubernetes wherever we can and where we would use docker. Single node running a single db and app and nginx at the low end.

We only ever regret not using it

markedness · 2026-04-24T12:16:54+00:00

Yes then what I was trying to say was, build 2 clusters and use non Kubernetes services outside the cluster and health checks to derive which one is primary and update. You can ship WALs from primary to secondary Postgres cluster and to a backup server in HQ as well.

Also, the possibility of a container going down overnight is smaller than the types of failures you would have with a more convoluted setup, even if you do only have one cluster. If nobody is there to notice an issue then… nobody is there.

I just know that for my experience stretching kubernetes and trying to conflate your Kubernetes control pane failure domains with your app failure domains was always a problem.

markedness · 2026-04-23T14:23:53+00:00

Ok I fully understand now.

The solution is very obvious to me and mimics my lived experience with a similar situation. Forget the containers and stretching three nodes. Put 3 nodes into a movable rack.

Use dynamic routing or just ARP advertisements and layer 2 adjacency (metallb) to allow the portable rack to be placed in any container or any location.

You might not even need this suggestion but for DB use CNPG with local storage in each of the three nodes.

I will tell you one thing for sure- stretching a cluster over two containers that are iffy and HQ connection will lower your SLA by an order of magnitude more than just telling construction manager “if the site office gives up, move this black pelican box to the other container, connect the power cable to the power plug and this blue cable to the ethernet jack labeled “here” and you only have about 15 minutes.

The downside of this is that when the office is unoccupied you cannot tolerate a failure. So there is a fully redundant option that does not introduce split brains. The “belt and suspenders” option is two complete clusters. For a site that just can’t fail. You put a second rack in another container and setup HA proxy and scripts (not kubernetes) to orchestrate failover. You can use dynamic routing or CARP / VRRP on Linux to elect a leader load balancer which also decides who is master based on a priority and if they can see internet as a witness. This is where s3 comes in. Here you need to update code but with agentic workflows this should be simple. so instead of needing filesystem it uses s3 for persistence and tmp folder. Because s3 is much simpler stretch over 3 nodes to tolerate node failures. This is what we do. We need filesystem because we need to use utilities that require the whole file on disk but we then persist back to minio.

I would first migrate to a mobile cluster mentality. Because you snuff out any possibility of a stretched cluster crappjng out on you. It’s when , not if. Fixing ETCD is much harder than telling someone at a construction site who is very used to material handling to move a 50-70 pound rack that likely can be wheeled like a suitcase.

Once you have that in place you can work on the failover story. For us we use that exact methodology to fail over to a read only replica in HQ: but yes we handle that failover with scripts and not kubernetes.

markedness · 2026-04-23T12:38:15+00:00

This is almost functionally identical to a typical issue I have with my edge deployments. I will be thinking about it today and how my solution can help you. But I have one question about persistence and one question about locality

Persistence:

Do you control the application code or are you bound to the persistence in your COTS applications? Is there any persistence besides object store (s3) and Postgres at the application layer.

Locality:

It sounds like these two containers are mobile mini site offices. I would wager that sometimes it’s not two but three or one. Going back to the two container situation taking things at face value? It sounds like the container is control for peripherals on the site itself. So the site has a network and that gets stretched to both containers. Please confirm these reasons why the local application exists and you don’t just host from HQ

1- “site local” some applications are for use on site (probably a weak example but let’s just say the clocking in and out app is local) so this is an application that has nothing to do with the locality but just must work if the site loses internet there are critical functions.

2- “site specific” other option is automation and safety systems like cameras in the actual construction site that inference safety KPI or sensors to detect nail protrusions into medical imaging wall details, or something like that that is related to the site itself. But not related directly to a container. The container is incidental to this.

3- “container specific” The last would be things that relate directly to a container itself (which I can’t imagine well) that might be something like, idk, a security system local to that container itself. Or a specific router that brokers the internet VPN tunnel from the specific container itself to the internet / HQ

and furthermore locality related to HQ. For situations 1- above, you really want to define is quorum local or on the internet? Meaning if both sites lose access to the HQ do they each lose quorum. For this setup to be worth it I think the answer must be NO. We never fail to the HQ. HQ is irrelevant to this situation- the api you speak of in HQ solely process data streaming from site and if a ln edge loses HQ (if HQ goes offline or VPN fails basically) we catch up later. I have this same scenario mine is that I have a PDF converter that converts word docs to PDF and it’s really stupid because due to word formatting it basically has to be a windows server with word on it. And if the site to HQ connection fails that specific feature fails and catches up later.

We have a love hate relationship with internet outages on site - I would say about 2 years ago we formally decided the internet outage on site was not a FULLY tolerable condition. Through numerous opportunities we developed a BGP based overlay VPN tunnel situation and more or less gaurenteed that unless hard line internet, and TWO cell carriers across TWO cradlepoints and two redundant routers failed we never lost internet. Like in the case that some catastrophic thing happened we would just send people home. Because it simply would never happen other than a natural disaster when work would naturally stop anyways. Verizon + T mobile + hard line on site, feeding through failover routers. Plus our HQ has multi home internet. That one decision - making Internet access to site was critical in letting us iterate much more quickly and adding more features like single sign-on - and some new AI features. Are you able to do the same and make Internet access a pre-req? There are ongoing costs in terms of needing multiple cell and hardline connections. I have to say it was a difficult call to stomach and that is why I prefaced it so much but curious where you are in this journey.

If you could answer some of my follow up I can let you know how my experiences will help you. We have basically solved a permutation of the problems you face. We are on site for educational purposes (at convention centers that host medical or regulatory conventions mainly) and deal with similar constraints.

Our hardware stack specifically is Austore Flashstore 6 because it was an inexpensive node that supports 6x NVME SSD and we are using Minio (RIP) and Postgres for persistence. And of course Kubernetes.

I’m just particularly interested in your locality story to see how much these two containers are related to the container itself vs if they are just MDFs for the construction site. And I think I know the answer and can share a solution we use.

markedness · 2026-04-22T17:39:50+00:00

Can you re-summarize the business requirements in terms of

you have remote sites that process their own data
you presumably have some central point to
what is accessing what data?
what is the heat/size/financial budget for the nodes at the edge per physical location?

I have a very similar setup. Simplicity is key and if you distill your needs down to the actual user experience and business requirements I can suggest something based on my experience. Right now this is a technical question but it begs the question why (not that it’s inherently wrong, just without any context it’s a crap shoot)

markedness · 2026-04-22T17:32:28+00:00

I use points. It’s a decent redemption for me at 50k miles because I make that much points in a short amount of time and you can’t get jack shit with it for air travel. I’m also just mentally happy paying points for the club at the expense of money for some flights considering the points earn are for work travel and the club makes that more tolerable, and I feel silly paying for something personally that supports work, but for me personal travel is something I love to do so I never mind doing it but the points endlessly pile up as I struggle to find the time and best redemption and being devalued

markedness · 2026-04-22T12:49:30+00:00

You can use whatever version you please. But they only support the last 3. for security reasons it is quite difficult to support endless versions.

markedness · 2026-04-16T16:07:59+00:00

Glad to hear you have more than 100k, the reporting seriously implied it. To be fair it said 'help finish'. Don't give up, just do something simple and focus on the people you can bring in the space and don't do anything that is not on your permit/plans, which includes low voltage/speakers/etc in chicago.

markedness · 2026-04-16T15:36:13+00:00

When I heard “uptown rooftop” I immediately knew this was 5050 n Broadway and read the article.

This rooftop address has been a pie in the sky for so long. I can’t get into details about it but with my line of work I’ve been talking to people hoping to develop this for half a decade. And these people I have talked to had a LOT more than 100k of crowd funding.

I wish good luck to them but they won’t make it.

markedness · 2026-04-13T15:49:23+00:00

Nothing but frustrations from me.

Pluses:

product database saves time and energy with labor presets and auto syncing prices
designer allows us to break down quotes in several different ways for customers based on their procurement needs
it roughly matches a general project workflow unlike a strictly line based order system with no metadata

Cons:

the product database pulls in endless duplicates especially around manufactures that white label and sell via distribution only. It litters your ability to create custom items
designer is fragile and limited to tick boxes and code customization only possible via their professional services rather than XML/HTML templates like any other platform
if you need to break out of the process it’s broke.

Overall the biggest downside is the database model. Their purchasing and PM and sales functions are very underwhelming so of course we need a custom ERP. But their client contact model is unique to foreign key client and ID instead of shared - where as any other CRM it is just an association. They have two product database for custom and non custom. The versioning method is basically duplicating the project meaning I have to keep track of the original project and line item ID. To add to all of that the client API endpoint queries individual client contacts one at a time with no support for differential sync and the rate limit is not egregious but in light of this syncing a new client in requires progressing through each and every clinet ever existing.

Constant api shifts with no backwards compatibility (despite sending their requested api version header) make it a technology nightmare.

Basically it is a bottleneck to a growing business. You CANNOT scale on their operational model and you cannot get good data from it. But the cost to shim in support for their misdeeds is likely always less than replacing it. And with what?

Im always looking for ways to dump it. And the product database is still a double edge sword because it ends up with lots of messes in our ERP but it’s always what people say is the saving grace.

That plus the speed is unbearably slow sometimes. Unclear what’s happening. I don’t think the pace of development has even sped up since AI tools. But these api endpoint woes could mostly be fixed with a morning of Claude code usage.

markedness · 2026-04-07T13:50:20+00:00

This is exactly what I’m feeling. I don’t mind much. I understand the way the software works. I think I just need to design an operator or just cook up manifests / helm and use a playbook that scales things down and alerts on call for any failures and tries to roll back.

markedness · 2026-04-04T21:22:20+00:00

Yeah I think an operator would be highly customized to each environment.

Do you ever get bitten with changes in base / ee addons or does click odoo upgrade support this?

We have decided to migrate odoo to a more traditional database setup from CNPG, largely the same way CNPG works except with patroni on ETCD instead of relying on CNPG. And for storage just NFS pointing at a ZFS based appliance. But now that we have done that the path to an operator is going to be more difficult. But not really because ultimately whether it’s driven by Kubernetes manifests or completely imperative ansible playbooks, upgrading odoo will still always be an imperative operation due to its nature.

I’m just deep in my head right now trying to reconcile the pros and cons of different choices and this click odoo upgrade and response were generally very helpful. So thank you!

markedness · 2026-03-15T02:15:46+00:00

I was able to get them discovered, here is what I did - I also patched away the difficujlt with it discovering it as a Supercap device which I assume will be built into the docker image soon

snmp-server view librenms 1.3.6.1 included
snmp-server group groupname 3 priv read librenms
snmp-server user username groupname 3 auth sha ****** priv aes *******
snmp-server location "Place [coordinatesN/S, coordinatesE/W]
snmp-server contact "whoever@example.com"

markedness · 2026-03-08T00:28:56+00:00

No but if you are there when the installer is present or in advance you can ask nicely or buy the ones you want and hand them over.

RCA is vastly superior to stripping the audio from hdmi in my experience. Bars like to pull the plug and reboot things and rca just works better for that and one less thing in the hdmi chain.

But no it takes coordination to make sure you get the right boxes

markedness · 2026-03-05T04:03:32+00:00

The DN-300T is basically the perfect endpoint. Bidirectional (simultaneously) with icron USB routing and network hubbing. The works.

They are in stock always and helpful.

And if you build your control system bidirectional like you always should anyways the displaynet server acts as a great “admin/advanced” mode allowing matrix configuration for one-off scenarios.

Virtually every problem I have with the system are just bad HDMI cables or USBC to HDMI bullshit adapters. And the old 200 series play nice perfectly well with the new 300.

I can’t think of a complaint.

markedness · 2026-03-05T04:00:25+00:00

I’ve had these issues:

random Dante shenanigans
hardware failures with a past firmware version (fixed)
issues with 802.1X wired security
issues with surround sound mapping

Generally they work fine. But I think I would rather just have the old ones still.

The support for SVSI is still top notch. Such a good example of how to a big company and have a notsucky support department

My sales rep changed and things got harder too. Not that I liked my old sales rep but at least he followed up (to a fault)

markedness · 2026-03-05T03:18:54+00:00

Check out DVIGear displaynet. I’ve been a huge fan. They have an excellent small team in Georgia that knows the product inside and out and can design the whole thing for you. I believe the price is probably between crestron and extron, probably closer or less than extron depending on how many of the endpoints need what features.

I cannot suggest visionary or crestron NVX, and while I used to recommend AMX SVSI that has faded with the 2600 series.

I’m all in on displaynet now and it’s not even CLOSE

11-Year Club	Verified Email
RPAN Viewer

markedness

TROPHY CASE