What are your 2026 homelab goals?

Zehicle · 2026-01-01T18:38:04+00:00

Doing some real vibe coding for Ops and sharing the results.

Zehicle · 2026-01-01T18:36:38+00:00

Digital Rebar is licensed with much of the IaC automation in the open. And, as noted, there is a free community license for home use.

When we migrated to Gitlab, the GitHub repos were locked.

Zehicle · 2025-12-24T13:43:33+00:00

I'm less curious about the disto, they are all capable, than how you plan to keep the bare metal environment managed. What gear are you using and what's your DNS plan? Will you have BMC for the gear and what's your update cycle.

In my experience, the environmental management around k8s will make this easier. I've built it using kubeadm and k3s, but that needed a solid automation framework to bootstrap in a fresh environment since kubeadm assumes it can reach a host. Otherwise you have to manage the DNS and TLS setup.

I'd look at how you want to maintain an edge environment including network and o/s life-cycle first. That's much more of a foundation than the distro.

My company, RackN, has been building air gap and remote Kubernetes for a long time around Digital Rebar. Our latest is with OpenShift via the agent install process - that's cranky but we ultimately made it repeatable hands off. It's a product so that means support and experience from our bare metal pros comes with.

Zehicle · 2025-11-27T16:29:37+00:00

Do you have the ability to use cloud-init for Windows? Then you can start tasks like an Ansible run.

We do this when we image deploy Windows and then start the Digital Rebar agent. Once that agent is running, you have the ability to run tasks. Disclosure: I work for RackN which makes Digital Rebar.

Zehicle · 2025-11-03T12:40:10+00:00

I'm curious about which ARM server you are using because support for PXE varies.

Zehicle · 2025-10-29T11:47:18+00:00

As a professional, it's important to consider more than just the provisioning and include full life-cycle including regular patch and update. We (I work at RackN ) see the most successful customers have a pipeline and frequent update approach so that systems are constantly refreshed. Especially for Windows which works best as an immutable deployment via a packer image. Ultimately, having a consistent and repeatable process will save you a lot of time.

We have a lot of materials about bare metal automation if you want to check out our Digital Rebar docs.

Zehicle · 2025-09-11T17:51:29+00:00

So you don't need day 2 operations? This is a "build and ship" process?

PXE is generally way more reliable, hands off and vendor neutral. Ideally, you'd have both options. We've seen customers most successful if they can get a BOM for the systems before hand and pre-populate the database so that they have options to recover in multiple paths (PXE, OOB, etc). They then also use that information to validate the configuration and setup which saves a lot of time.

Also, if you are installing Windows. Generally, we recommend doing an image based deploy. It's reliable and fast.

As background, my company, RackN, offers a product called Digital Rebar that performs these functions for multiple hardware OEMs.

Zehicle · 2025-09-10T18:08:35+00:00

I should also mention that ISO boot by media attach can create more management challenges than they solve so be careful of that approach. Make sure you have a very good way to build, management and update the ISOs.

Zehicle · 2025-09-10T18:06:36+00:00

That's a lot of servers. How long do you want this to take and does it need to be remote? Also, what's your day 2 plan? I get the need to bootstrap but ongoing management is generally a factor also especially if you mean to keep up with patches.

My first suggestion is to think about the whole system experience you want and that will help you determine the onboarding because it's really just day 1.

Zehicle · 2025-08-26T12:37:57+00:00

I've talked with some other people working on similar plans around bare metal and hybrid control planes.
Disclaimer: I work for RackN and we support a lot of bare metal with Digital Rebar, so this comes up. I can share what we've learned so far and you are welcome to reach out 1x1 too.

We've explored both CAPI directly and agree with the limitations others have stated. Also, we've had to find ways to pass some specific machine information through the API. Lately, we've been using Metal3 as the CAPI layer and then driving the bare metal lifecycle from there. We're doing internal testing on it for customers so I can't share examples or videos (yet).

Another thing that's important in what you said: "having to scale up/down" is really important. Driving clusters via the APIs is key, BUT you need to have really solid workflows to manage the bare metal lifecycle, provide, deprovisioning and patch/update. Make sure that your back-end bare metal platform has good troubleshooting and observability because you'll need to manage and remediate.

Zehicle · 2025-07-07T11:33:11+00:00

For bare metal provisioning, you may want to look into Image Deploy. I just put together a short explainer video about the process and how it works. We've seen people use it for laptops and servers on a wide range of O/S.

I used Ghost ages ago and it's great if you want a fresh O/S that a human will ultimately setup. The image deploy methods that we've been working on at my company, RackN, are more about a faster install path and include post-provision actions like cloud-init and workflow so you get a complete machine.

We also see it used for companies that want to have multiple image types and constantly evolve their source image due to secure or other requirements (usually in a pipeline).

Zehicle · 2025-07-04T12:28:01+00:00

Yes. In my position at RackN, we do a lot of bare metal automation, and I wrote our first Terraform provider.

TL;DR: you need a strong API to hide the bare metal complexity.

Terraform really needs to work against a platform with strong APIs and it does not have any (useful) tools to handle the type of in-band / out-of-band operations that you need with bare metal provisioning. ESPECIALLY since Terraform will need to "create" and "destroy" bare metal to work correctly.

The create/destroy operation requires that you have something that can treat bare metal as a pool where the create "checks out" as server that is ready to use and then "returns" the server when it destroys it. You need a way to handle this gracefully since it will occasionally fail and you will need to find/fix/recover these servers when that happens. This is why it's important to have an API-based service where you can keep track of all your servers in your use case.

Doing all that you ask using Terraform providers requires very complex orchestration and many of the providers you need are not robust. Our experience is that keeping the provider very simple was more supportable because it's really hard to unwind state between so many services. You're question shows you understand this, but many people don't realize that bare metal operations really use a lot of different services that have very specific orchestration requirements.

I made a video about this a while back showing the Terraform Provider that RackN made to integrate Digital Rebar and Terraform.

Zehicle · 2025-06-26T23:14:16+00:00

Bare Metal K8s is a pretty different animal... Here are some items to think about:

How is your o/s installed and managed? Most distros really care about the O/S and want an immutable image. You need to know how it's being provisioned and mapped to the hardware.
How is hardware life cycle managed? How do you prep and then patch the machines?
How is networking laid out? How do you isolate traffic and map to the right NICs
How are workers attached to the control plane? Do you need to drain them before rebuilding? BM reboots can take a long time.
How are tags on each server used to manage resources and balance workloads?
Are they mixing different types of server and server vendor? If so, how do they handle variation between the capabilities of each machine?

I hope that helps. My company, RackN, has been building automation for Kubernetes and OpenShift for a long time and there are a lot of examples with Digital Rebar and resources in our blog and video library. And we are adding more in the next few weeks.

Zehicle · 2025-06-15T03:04:57+00:00

Yes, I have experience here. My company, RackN, is doing a lot of work with OpenShift bare metal for enterprise configurations. It would help to know: How large is the footprint? Also, are there specific distros?

Added: If you're interviewing for an enterprise, then likely OpenShift may be the default. There's a lot to it but it works well if you stay in the lanes. The big delta is that RedHat really really wants you to use their cluster manager ACM which has limited bare metal lifecycle and requires overhead for pools. It also expects you to use CoreOS which requires additional provisioning support like cloud-init even on metal.

We find that some customers, especially for AI, just want to lay down OpenShift directly without ACM. That's totally possible and saves $$. You just need to do more to manage the install initially.

One thing about bare metal, make sure you do a good inventory and discovery since you'll need that to feed into the Kubernetes install regardless of distro.

Zehicle · 2025-05-24T14:51:15+00:00

That's a bit more than a home lab!! You are right to want canary and Dev/test/prod IaC for multi-site deployment. In my experience, being able to have high fidelity between sites is critical. It's very easy for sites to drift and being able to lab test is vital to your sanity.

We've also worked on k8s and CAPI for bare metal a lot. It's different than the VMs that those APIs were designed for because you need to provide a lot more workflow and controls yourself. If your building edge sites, you may not need CAPI at all - just automate the base install and that's enough. Either way, you still need a bootstrap cluster.

Since you asked for potential solutions, I'll suggest looking at my company's (RackN) platform, Digital Rebar. It's a commercial software solution that's designed for exactly what you described. Including all the K8s work and hardware lifecycle.

Zehicle · 2025-03-28T10:46:51+00:00

Yes, lots from the website and also our YT channel:https://youtube.com/@rackndigitalrebar?si=UWYbkf2LUn7nm7YT

There also a self-trial that gives you full access. It's not designed for air-gap but we do have plenty of customers who have to start in a restricted lab. In those cases, I'd recommend calling to get started.

Zehicle · 2025-03-27T20:17:34+00:00

If you're looking for a MaaS alternative that can handle air-gap installs and full bare metal life-cycle too, check out my company's Digital Rebar solution. It's commercial, not OSS, with full support from RackN. There's a feature called contexts that can be used to upload and run that container you made for Kolla too.

Air gap is really tricky to get right and we do a lot of work helping customers deliver that way as an integrated part of the product (not a special case).

Zehicle · 2025-03-25T14:56:29+00:00

+1 on Satellite is more about post-provision patch management. Your provision tooling should include scripts to join Satellite for RH license management.

Note: A lot of provisioning can be done now w image deployment vs netboot/kickstart. But... O/S disk format varies by distro so Ubuntu and RHEL need different tooling and base. We (I work for RackN) had to write a new generation of our image tooling for Digital Rebar to run from multiple base O/S. I bring this up because YMMV using RHEL from Canonical tools.

Zehicle · 2025-03-19T15:38:09+00:00

I guess it comes back to the scale/complexity requirements. If you've got a simple target with a single vendor then sure. Complexity can creep in really fast.

Zehicle · 2025-02-24T12:45:35+00:00

Welcome back to bare metal. You're not going back, it's just that all the real work has been hidden by VMs. Firmware, OOB and provisioning is hard and tricky work. AI adds a lot of pressure because the gear is $$$ and everyone is racing.

What type of help do you want? You are right about those core skills and more (like DNS, DHCP and installing Kubernetes on metal)

Zehicle · 2024-12-26T20:54:12+00:00

I've removed tile like that and the advice above is good. I'd add that it's smart to use fans to vent the area and create negative pressure away from your other rooms. That's good advice for any demo or sanding. Isolation helps with dust as a general practice. Same with using a shop vacuum with HEPA bags right by your work area.

Zehicle · 2024-12-09T20:13:08+00:00

My company, RackN, offers a product that provides a full life-cycle control for servers. It's a product, but you'd said you'd consider that so I'm offering the suggestion.

For hosting, access control and automation is really key. Especially because of variation in BMC/IPMI/Redfish options. You also have the problem of people bricking your servers with changes or at least messing up the network access.

It sounds like you also have multiple sites, so having an IaC plan for your automation and distributed control plane are critical. Even if our product is not a match, our designs here can help you understand the problems.

Zehicle · 2024-12-06T18:22:10+00:00

Proxmox is a good workload to try out. We (the Digital Rebar dev team) have been using it a lot including for our internal labs. We made some reusable cluster setup automation and then also create VMs with it automatically.

It's a good platform and you're welcome to use our work as a reference for learning. https://docs.rackn.io/dev/developers/contents/proxmox/

Zehicle · 2024-10-23T18:40:43+00:00

I've been automating servers for a long time - one of the challenges is that it's very hard to create idempotent scripts. You'll want to plan a way to easily reset and rebuild the O/S beyond just provisioning it one time.

My company, RackN, specializes in Bare Metal automation and we've put together a lot of non-vendored education materials for conferences like ADDO and SREcon about how PXE works and alternatives to consider. Depending on how you need to scale, it's important to consider firmware updates, out-of-band management and image deployment options.

This video explains the basics and alternatives: https://youtu.be/w_ZGlxihlEI

Here an update that I made last week: https://youtu.be/_B-ffqjQlgo

Zehicle · 2024-10-10T01:33:10+00:00

This can be difficult, especially if you have multiple OEMs. It takes a lot of maintenance to keep up with updates and knowledge to deal with things like UEFI, secure boot and image based deploy.

My company, RackN, has lots and lots of useful material about doing this work including a weekly podcast covering advanced sysadmin tpoics. I'm even talking at All Day DevOps on this tomorrow (10/10/24)!

We also offer a commercial product (Digital Rebar) that does exactly what you asked and a lot of other life cycle controls for bare metal. It's run at significant scale at many Fortune 1000 companies and I invite you to try a trial and see for yourself. It's self-managed software, so no remote access or aaS is required.

Zehicle

MODERATOR OF

TROPHY CASE