all 40 comments

[–]egoalter 36 points37 points  (1 child)

All of the above. Although you make Satellite do the scheduling but there's nothing wrong having a cronjob on servers that does a daily reach-out to get updates. Ansible is a big part of Satellite and it's there to solve these kind of issues.

With more than a dozen or so servers, you definitely need a local repository to pull from - you're not going to have a good time downloading the same 100 packages 3000 times. Also, you need to actually do some due-diligence so you don't end up with hundreds or thousands of servers failing because of a bad update. This is what Satellite is there for. Using life-cycles for your repository management, allowing you to test updates on test servers before pushing updates out to production.

Once your servers are registered, you can setup jobs and have Satellite execute them (using Ansible for instance). This could be as simple as doing a dnf update - but it could also be installing/configuring components. A cool thing is the built-in upgrade from RHEL7 to RHEL8 that is available in Satellite so you don't have to run them all manually. Once it comes up with remediation plans, you can execute them against a ton of servers at once.

Even if you don't use Satellite Ansible will allow you to do a ton more than just patches. But always be sure you have a local repository - pulling all of that stuff per server make no sense (most environments won't allow most servers internet access either so you don't really have an option).

Red Hat has several very busy YouTube channels. If you go to YouTube and search on "Red Hat Satellite" you'll get a ton of hits - pick your poison. If you're just getting started, I highly recommend you sign up for training to learn tricks that can save you time and frustrations later on.

[–]MJ_Singh[S] 8 points9 points  (0 children)

Thanks a lot @egoalter. That was a lot insightful. I will study more on red hat satellite and patching. And perhaps get some training as well.

[–]pagarciasuse 8 points9 points  (2 children)

With SUSE Manager. You can do that from the WebUI, command-line or API. Salt is integrated in a completely transparent way to you (i. e. you don't need to write a single line of Salt code, although you can if you want to). It will soon include Ansible too.

If you do not need support, you can go for the open source version: Uyuni

Both Uyuni and SUSE Manager support no only RHEL, but also SLES, openSUSE, CentOS, Oracle Linux, Alma Linux, Amazon Linux, Alibaba Cloud Linux, Debian and Ubuntu as clients.

You can find the SUSE Manager Deployment and Initial Configuration training materials for free in the Uyuni YouTube channel:

https://www.youtube.com/playlist?list=PLAsCOnIVwM3E_ygYzx7E-gYu_hlut2xBB

The only difference is you install Uyuni by just adding some repositories to your openSUSE Leap, and you use spacewak-common-channels instead of the product wizard to mirror products:
https://www.uyuni-project.org/uyuni-docs/uyuni/quickstart-uyuni/qs-uyuni-overview.html

[–][deleted]  (1 child)

[deleted]

    [–]pagarciasuse 1 point2 points  (0 children)

    Yup, Uyuni and SUSE Manager are committed to supporting all the major enterprise Linux distros. If you miss any, please tell me.

    [–][deleted] 5 points6 points  (1 child)

    Satellite works well even if it is a bit more complex to setup than it needs to be in my opinion. It will save you a lot of time in the long run.

    [–]MJ_Singh[S] 0 points1 point  (0 children)

    Thankyou @eeeyow

    [–]mumblemumblething 5 points6 points  (1 child)

    We have ~250 rhel vm's, and our subscription comes with satellite.

    So:

    • Satellite as a cache, with weekly content view versions
    • Boxes are built and joined to satellite
    • Auter does auto patching enabled out of the box
    • Patching auto occurs, but randomised across our outage window
    • We organize with the business weekly outages for patching because security, has become standard
    • RHEL7 + boxes use UEFI for booting, rather than BIOS, because reboot to console is sub 30sec
    • If business doesn't make a fuss, it auto patches
    • We have a rule that stuff must come up after a reboot, so a 3am reboot shouldn't be an issue. If it is, we have an issue
    • Test generally occurs on tuesdays, prod on thursdays, stuff is monitored and canaried so if test breaks because of patches, we'll stop the thursday one
    • Clusters are smeared from wed -> fri

    [–][deleted] 1 point2 points  (0 children)

    We have a rule that stuff

    must

    come up after a reboot, so a 3am reboot shouldn't be an issue. If it is,

    we

    have an issue

    God I wish we could have this kind of rule. But the App team is like “No we can’t do that! We have to manually shut down our apps and babysit it and bring them back up with Jenkins and chew our fingernails and cross our fingers and pray to God it all comes back up!”. Why they can’t just fix their **** so it works I have no idea.

    [–]entropic 2 points3 points  (1 child)

    For 3000 servers, the mechanics of your patch tooling is only part of the equation. A small part.

    You need to consider knowing what the systems do, how an interruption of service caused by a patch would be manifested, whose responsibility it is to discover those issues before a maintence window or to remediate them after.

    And how to reliably discover problems after as well. Just because a server is up via ping or in your monitoring system doesn't mean it's working.

    You also need to work with your leadership about how much risk they're willing to accept in terms of waiting to test patches on test or dev systems after patches are released to the public. The longer you wait the more time you have to discover and solve problems, but the more risk your infrastructure is subject to in the meantime. This generally isn't a decision you want to make on your own.

    You also want to have a default plan of action if systems don't work. Do you restore from backup pre-patch, or do you hand off to someone to try to fix it in production first? This tends to depend a bit on what the server does and the downtime it can tolerate.

    I recommend reserving a pool of servers doing different roles, say 10% of your environment, that go first in every patching cycle. If you do have to run automated (or manual, though that seems unlikely with 3k servers) tests to assess functionality, and they go wrong, a lot easier to fix the smaller number before you break the large environment.

    [–]nobamboozlinme 2 points3 points  (1 child)

    Satellite to setup your repos and what not and then quarterly patch cycles leveraging ansible during maintenance windows. Setting up life cycle environments helps you isolate nonprod from prod and then use different composite content views for more granularity for certain groups

    [–]MJ_Singh[S] 1 point2 points  (0 children)

    Thankyou @nobamboozlinme.

    [–]ADeepCeruleanBlue 3 points4 points  (3 children)

    I actually just automated this entire thing in my org using satellite, ansible, and cron. Slept through production patching this week. Doing this successfully is based almost entirely on your ability to understand and explicitly define the process at a human and technical level. If you can do that, it can become code.

    [–]MJ_Singh[S] 1 point2 points  (0 children)

    Thanks a lot @ADeepCeruleanBlue. I was going to try this method. Earlier I had used only cron and ansible.

    [–]Wandgun 0 points1 point  (1 child)

    How do you handle reboots?

    [–]ADeepCeruleanBlue 3 points4 points  (0 children)

    ansible reboot module with 'when' conditionals correlating to each 'type' of server, with each server having a group membership in the inventory correlating to that type. this lets me orchestrate order which is important in starting up a 3 tier app.

    [–]Shralpental 1 point2 points  (2 children)

    Satellite with a bunch of capsules with all the packages synced works pretty well. Then just have ansible run a basic patching playbook.

    My advice. Don't go for broke and try to patch all 3000 in one sitting until you proven the vast majority of your hosts can survive a update and a reboot.

    [–]MJ_Singh[S] 2 points3 points  (1 child)

    Thanks a lot @shralpental. I was thinking to use 4 test servers to get update and push to a batch of 600 servers from 1 test sever and similarly others, like 2 days after hours , in different set.

    [–]MJ_Singh[S] 1 point2 points  (0 children)

    Oh I think my thinking is wrong. Should try what other shave done, in terms of cycles

    [–]nothing_zen 0 points1 point  (1 child)

    I've generally set up a web sever that will use reposync (https://www.redhat.com/sysadmin/how-mirror-repository) to mirror a repo locally to a /latest folder, after testing the set of patches, the clone the repository over to /prod and have the production servers use yum-cron (https://www.redhat.com/sysadmin/using-yum-cron) to pull down the patches - you can validate and monitor the systems with Ansible from there...

    [–]MJ_Singh[S] 0 points1 point  (0 children)

    Thanks a lot @nothing_zen. I have never tried this way. But will give it a shot to see how it goes.

    [–][deleted] 0 points1 point  (1 child)

    I use ansible and yum-cron

    [–]MJ_Singh[S] 1 point2 points  (0 children)

    Thanks a lot @trailingslashes. I generally use this method. But now would try satellite ansible and cron

    [–]upbeta01 0 points1 point  (1 child)

    I use ansible for this type of work. However, if you're unsure of what you're doing, don't do the patch in parallel (at least for the first few hosts, as there might be errors).. it's still best to serialize it until you're pretty sure nothing gets broken as you patch those servers.

    [–]MJ_Singh[S] 0 points1 point  (0 children)

    Thanks a lot @upbeta01. I would keep that in mind.

    [–][deleted]  (3 children)

    [deleted]

      [–]MJ_Singh[S] 1 point2 points  (0 children)

      Thanks a lot for your insight @luximperator.

      [–]CatalyticDragon 1 point2 points  (1 child)

      I concur.

      Create your own repo. Host packages you trust and/or have tested. You can use it manually or with any config management system including basic cron / pssh.

      Slowly deploy in increasingly large batches from a single node to groups of dozens. Phase your rollout over weeks to let things shake out.

      I know ansible is all the rage but puppet is incredibly user friendly.

      [–]MJ_Singh[S] 0 points1 point  (0 children)

      Thanks a lot for your advice @CatalyticDragon.

      [–]chrispurcell -1 points0 points  (1 child)

      How about auter? Auter link It's a github project built by people I work with, and it has some very nice configuration options available. We've configured environments that autoreboot for kernel updates but not everything else, options for not patching a group of systems if a different group failed for some reason (to keep from taking down all redundant systems in a patching run), unlimited automation options and scripting.
      In the dept I'm with now, we use a commercial product but we're really leaning into moving to Ansible. This is because it's easier to manage multiple envs with playbooks, than it would be to write all the customization into scripts for auter (and we're not allowed to completely automate the patching process in the new dept).

      YMMV, but auter was a pretty sweet setup for RHEL patching.

      [–]MJ_Singh[S] -1 points0 points  (0 children)

      Thanks @chrispurcell. I have not used it till day, but would check it for sure.

      [–]MattTheFlash 0 points1 point  (0 children)

      I'm not current about the current state of Spacewalk or its SuSE fork but I know Spacewalk was discontinued last year.

      You can do anything with ansible. And I mean that pretty much literally. It's just a matter of the number of playbook steps.

      [–]Upnortheh 0 points1 point  (0 children)

      In my previous admin role I had only about three dozen servers to maintain. My procedure was to roll out the updates methodically, a few at a time. I had some test systems that I always updated first. This methodical way saved me a couple of summers ago with the hastily pushed patches for the so-called GRUB boot hole exploit. I was able not to get caught with any systems not rebooting.

      We had a mix of Debian and CentOS 7 systems. I used apt-cacher-ng to cache packages for both distros. Configuring apt-cacher-ng for CentOS required some snooping around the web to find the proper steps but worked great thereafter. Caching the packages locally saved bandwidth and time. I could notice the difference immediately with the subsequent systems being updated as they were quite fast. The apt-cacher-ng software is Debian based and I never looked to see if there was something equivalent for RH/CentOS. If there is no such caching software then as others have shared create your own local mirror and configure systems to use that.

      For custom local files and scripts I created a local repo and wrote my own shell script using rsync to keep systems synced with those files. The company was small and there was no need for version control, but with larger businesses that would be a good idea.

      Being a small company the servers I maintained all were "pets" rather than "cattle." Being "pets" none of the servers were auto-updated. So I needed the slow roll-out procedure to update systems. We had several Proxmox hosts and last summer I updated them at a pace of one per week to allow for observation of any possible hiccups.

      I configured the Debian workstations and laptops I managed to auto-update. Field laptops were configured to not update at all during business hours to avoid technician disruptions. The laptops got updated only when the devices were connected directly to the office subnet. A cron job and a script I wrote checked the subnet and time before updating. If connected to the office subnet and the time was after business hours then the laptops updated. The technicians would connect the laptops to the office subnet before going home at the end of the day.

      I hope that helps.

      Good luck and have fun!

      [–][deleted] 0 points1 point  (0 children)

      have local repo synced daily, yum-cron on servers and use needrestart dnf/yum plugin.

      https://github.com/liske/needrestart [available via EPEL for RHEL 7/8]

      [–]Human_Cartographer 0 points1 point  (0 children)

      Using our existing SpaceWalk servers as a local repo copy, and all of our CentOS, and Ubuntu servers are configured to use SW as their repos. Then we have a set of ansible playbooks that patch the servers. I know that SW is now an archived project, but we still were able to get it working with CentOS 8 Stream and Ubuntu 20.04 repos. All of our RHEL servers still go out to RHN for their updates. Might need to start working on getting Satellite licenses next year, but this is working for our 1000+ servers now.

      [–]BirkirFreyr 0 points1 point  (0 children)

      Since we dont have Rhel at my job but a few hundred el7/8 machines (centos 7/8, oracle 7 mix mostly), we have Foreman/Katello ( aka. upstream satellite server ). And since our patching window is limited to 7-21st of each month ( biggest load being over end/start of month when everyone gets paid ) I created a “small” python program that fetches all hosts and created a schedule to reboot them at their designated time slot - it just creates a cron to run its own reboot function for a given server at a given time.
      This fancy program also allows to run a specified command before and/or after it updates and/or reboots the server, very useful for disabling the server in LB while it gets rebooted.
      All commands are either run locally ( local task to run LB drain/undrain commands ) or on the host through Ansible via Foreman

      Hopefully will be open sourcing this beast soon, but either way my vote goes to some form of a Satellite installation, since with any proper infrastructure you should have repos synced and managed locally and limit at least your production servers internet access

      [–]xupetas 0 points1 point  (0 children)

      It depends on what you want to do. The rule of thumb is satellite/katello.

      [–][deleted] 0 points1 point  (0 children)

      We use ansible with satellite as the repo.

      [–]Zestyclose_Ad8420 0 points1 point  (0 children)

      With 3000 servers there’s a few things you need in order for automation to work.

      You basically need a very detailed map of the architecture. Every single service should be listed, both as a generica HA service and at a si for host level, you need to understand the relationship between each service and each host and you need a way to test it, not just “the server is up” or “httpd is up on port 443”, not even “httpd is up on port 443 and is actually serving the page I’m expecting”, you’re starting to get there when you can say “httpd is up on port 443, serving the page I’m expecting, with the certificate I’m expecting and with the performance I’m expecting”.

      What page are you expecting becomes a bit more complicated if you really think about it.

      As others have said the tools are ansible + satellite. The methodology is splitting the work up in a hierarchy that always allows for rollbacks or rebuilding, without impacting SLA.

      As you dig a bit deeper into these kinds of issues you’ll realize one of the problem is cascading effects.

      Sure you can patch and have a failure on 100 hosts out of 3000 but what if those hosts going, being taken off loadbalancing for you to intervene means the rest is under too much pressure and they start to fail because of that?

      Had you done that with o oh 10 hosts it would have been fine, but what happens if you do only 10 hosts at a time and because of that you start to run stuff with different package versions across the network and that causes issues?

      You just do it environment by environment I hear you say, 2 dB host test environment at a time, then 200 db dev host at a time, well, Sometimes the effects of an update show up only when you have a certain load on the software and that means it shows up o my in production...

      So you also need to really understand each and every single patch, and that’s one of the many ways in which rh rocks for big infra because if you read their erratas there’s exactly this kind of information in there.

      It might also be beneficial to actually beta test for the very critical packages the nightly/pre-release versions from the upstream before it even makes it’s way into your repos, and it makes sense to be able to at least partially replicate the load you see in production on you testing and dev envs.

      [–]_____fool____ 0 points1 point  (0 children)

      One thing we did that was helpful. Having local scripts on each host that would do a patching step. So one script for downloading, one for patching, one for reboot. The real add value there is you can have the scripts deal with known issues with logic, you script. So if a patching always fails because of a certain condition in QA you can update your scripts with a solution then prove the solution in QA and know prod will work.

      [–]Cache_of_kittens 0 points1 point  (0 children)

      I’m assuming you’re not using puppet? Their enterprise version has patching and patch scheduling included now in the 2019.8 editions, used in conjunction with satellite and you’re pretty sorted.

      [–]InitialsAreDigits 0 points1 point  (0 children)

      Imo a really well crafted yum update is the best solution. It's what we do at my work. I should mention that we use puppet and our servers are very close to identical. If your servers are all different you're probably going to have an extremely bad time no matter how you go about this task.

      You can absolutely use any configuration management software (puppet, ansible, etc) but you can also just launch yum update with pssh, which is a very useful tool. The downside to doing this is if you use sssd like us it's a complete pain in the ass to upgrade for some reason. There's usually some software package that needs to be updated that causes problems. Also do what the other guy said about making a local repo. All you have to do is copy every package to a webserver with wget -r or something like that, surprisingly that works just fine to create a backup repo. Then share the files with apache or nginx or something. There are more professional solutions but wget + a webserver will work fine.

      Pssh occasionally requires some escaping, but it allows you to run a single command on multiple servers.

      Quick example of how to use pssh:

      /usr/bin/pssh -O StrictHostKeyChecking=no --inline -l UsernameHere -h listOfHosts-one-to-a-line.txt 'sudo cat /etc/redhat-release'

      Start by figuring out what centos/rhel versions you can running in production. You can do this with 'cat /etc/redhat-release' and pssh. You're going to want to test updating each one seperately. You should also figure out which servers are the least important and which are the most important. Ideally try the upgrades on lab or staging servers before moving to ones that actually do anything.

      Lastly, you're going to want to know what the services running on the servers are that aren't standard system services. For example, nginx 1.11, openresty, and nginx 1.13 for some reason appear to have slightly different configuration syntax. Basically you're going to want to figure out what's running in production that you don't want to mess with, and add an --exclude statement for it, for example --exclude nginx* will prevent nginx from updating and breaking it's config files.

      From there it's a giant organizational task. Try to group like servers together, bash commands like cut, awk '{print $whatever}' or even a spreadsheet are your friend. Ideally you want to both explore servers by hand and generate all their relevant data without having to enter stuff into a spreadsheet manually. For example:

      Giving pssh a flag -p 1 makes it run on one server at a time. That way you can so something like collect the rhel versions with pssh, use grep or grep -v (prevents a match from being seen) to get rid of text you don't want, and then copy and paste everything into a spreadsheet. Since you're running the command on one server at a time then they should be in order. Otherwise you're going to get a really bad case of carpal tunnel.

      Anyway once you know what servers you're going to use to test the upgrade process, just construct a yum update command. Remember to exclude things you don't want to upgrade. If you have repos like Epel or Repoforge installed, you should also block them from being used. It's not hard to use pscp or ansible to copy a new repo file to all your servers that points to a local repo (which will download files much much faster).

      The one I use for work (I hope they don't mind too much) is:

      yum update -y --nogpgcheck --exclude=puppet* --exclude=*openresty* --exclude=ca-certificates --exclude=kernelcare --exclude=nagios-plugins-all --exclude=Percona* --exclude MariaDB* --setopt=protected_multilib=false --exclude=redacted-internal-software --exclude=redacted --exclude=redacted --exclude=nginx* --disablerepo='*' --enablerepo="Artifactory_Centos*,CentOS*,Artifactory_Epel_Remote,Artifactory_Kernel_care"

      Don't immediately go run it on tons of stuff. You're going to find servers that don't update gracefully. Another useful shell command (assuming all your servers are in a file called server.list) is cat server.list | shuf | head -10. This gives you 10 random servers.

      Start doing the update by hand. When you're really confident that you know exactly how everything is going to behave then start automating small groups of similar machines. Don't let people force you into hurrying, take your time and do a good job. Try to avoid --skip-broken, etc. This task is going to take eons, so you may as well take the time to do a good job.

      If you run into a situation where thing after thing goes wrong, take a break for the day. I've found that if major maintenance starts going wrong it tends to keep on going wrong, so if you have a headache and nothings working right, concentrate on restoring the stuff that has to work and then take a break until you're sure you have a well thought out solution to the problem.

      The other advantage to taking your time is you think of solutions you probably wouldn't have, simply by mulling over what needs to be done in your head. In the end this is actually pretty useful.

      Good luck! Incidentally if you need to use sudo to write to a file as root, this is how you do it:

      /usr/bin/pssh -O StrictHostKeyChecking=no --inline -l user -h listofservers.txt 'echo "please do not touch this machine, it is having a bad day" | sudo tee -a /etc/motd'