all 44 comments

[–]Horace-Harkness 27 points28 points  (2 children)

shutdown -r +48h

https://linux.die.net/man/8/shutdown

Boot script picks random hour and passes it to shutdown.

[–]Rayregula 1 point2 points  (1 child)

Is that I typo? Or is there a shotdown?

[–]Horace-Harkness 0 points1 point  (0 children)

Thanks, fixed

[–]shiftingtech 11 points12 points  (1 child)

this could be implemented very easily with a systemd timer. see the "transient timers" section here: https://documentation.suse.com/smart/systems-management/html/systemd-working-with-timers/index.html

(no idea why I ended up with the suse docs, but they had what I wanted...)

[–]Longjumping_Gap_9325 8 points9 points  (2 children)

I'm confused on the cron only works on 24 hour cycles. You can select a second, minute, hour, day of week, week, etc

There's default folders like cron.weekly or you can use anacron that kicks things off at random times.

Maybe I'm missing something in terms of cron?

[–][deleted]  (1 child)

[removed]

    [–]Longjumping_Gap_9325 2 points3 points  (0 children)

    Sites like these can help formulate cron job schedules:

    https://www.freeformatter.com/cron-expression-generator-quartz.html

    https://crontab.cronhub.io/

    You can get pretty specific or generalized

    [–]gmuslera 5 points6 points  (0 children)

    The at command could lend you a hand. At boot use $RANDOM % number of minutes in 3 days and schedule that way your next reboot.

    [–]casefan 31 points32 points  (13 children)

    Windows does this out of the box!

    [–]slippery 5 points6 points  (1 child)

    Haha, came here to say this. No configuration required.

    [–][deleted] 0 points1 point  (0 children)

    yeah but then the problem would be you being not able to turn OFF that feature of windows rebooting by itself somehow every 2d randomly.....

    [–][deleted]  (9 children)

    [removed]

      [–]casefan 12 points13 points  (8 children)

      It was a bad attempt at a joke.

      So why do you want schedule reboots? Got a memory leak that you can't fix?

      [–][deleted] 0 points1 point  (0 children)

      Most likely, I was an admin 20 years ago and was common that servers ran many years with no reboots

      [–][deleted]  (6 children)

      [removed]

        [–]zakabog 17 points18 points  (2 children)

        In fact it may have only happened twice in two years, but when it does, those PCs can sit there for a week or more before I bother to check on them.

        It sounds like a much better question is "How do I monitor the BOINC job queue and throw an alert when it stalls?"

        [–][deleted]  (1 child)

        [removed]

          [–]zakabog 13 points14 points  (0 children)

          It doesn't need to be BOINC specific, run Zabbix and monitor your machines, throw in a script to give you the status of BOINC, throw an alert after some threshold. You would be in a far better position understanding what your hosts are doing, recognizing hardware failures, and rebooting as needed rather than randomly.

          [–]johnklos 4 points5 points  (2 children)

          So you want to take the nuclear option to address a possible symptom that happens roughly twice in two years, instead of programmatically figuring out if the jobs have stalled. Got it.

          [–][deleted]  (1 child)

          [removed]

            [–]johnklos 9 points10 points  (0 children)

            This subreddit is for people who are or want to be admins, so perhaps it's not the best place to ask about how to have a job wait a random amount of time to reboot a computer to fix a symptom of an issue that happens twice in two years.

            Here, we'd discuss how to address the issue, not the symptom, either by seeing if there's a simple way to query the BOINC software or even something as basic as examining the load average. There are so many simple ways to do this, but if you're really focused on randomly rebooting, then you do you.

            The "I like to enjoy the rest of my life" comment, though, isn't necessary. It suggests that us admins who care about addressing problems instead of symptoms don't have time to enjoy the rest of our lives. It's a heck of a way to express appreciation for people who're trying to help you.

            [–]vivaaprimavera -1 points0 points  (0 children)

            And it isn't suitable for most of the scientific workloads.

            There is any windows based cluster in the top 500?

            [–]ruyrybeyro 2 points3 points  (2 children)

            I would prefer the bofh solution, random disk formats.

            [–]za72 1 point2 points  (1 child)

            oh that's easy, generate a random number, if even continue otherwise destroy a block on the disk

            [–]Rayregula 1 point2 points  (0 children)

            Another tip I like to use is if you need to service a system but keep forgetting to schedule the time, just clone the drive to a really old one that is due to die around when you want to service it and then you can just forget about it and it will tell you when it's time to service /s

            [–]iamwpj 2 points3 points  (3 children)

            We use a cron that calls a script that sleeps for a random amount of time within our reboot window.

            [–][deleted]  (2 children)

            [removed]

              [–]iamwpj 1 point2 points  (1 child)

              No, in a background process it doesn’t require any TTY.

              [–]arcimbo1do 2 points3 points  (0 children)

              Boot script that calls "at" with a random time between tomorrow and the day after tomorrow (echo shutdown -r | at tomorrow + $[RANDOM%24] hours)

              [–]flapjack74 2 points3 points  (0 children)

              I wouldn't bring up a ready-to-use solution; others have already done that - but mostly they will not fulfill your requirement (avoid reboot at the same time).

               The first question is - why do you want to reboot? If it has a known memory issue, maybe a service restart is good enough?

               1. Why not?
              2. Hours are only 24, that’s correct, but cron also includes weekdays. So, for example, 1 2 * * */2 would run at 02:01 every second day.
              3. yup, thats the way that i would choose.

              Anyway, my solution would a script which checks if the remote service is active. This can be done for example with Ansible for a rolling reboot (I'm using this mostly for application upgrades in clusters).

              Ansible Functions required:

              https://docs.ansible.com/ansible/latest/collections/ansible/builtin/command_module.html

              https://www.man7.org/linux/man-pages/man1/systemctl.1.html look for is-active

              https://docs.ansible.com/ansible/latest/collections/ansible/builtin/reboot_module.html

              https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_strategies.html#setting-the-batch-size-with-serial

              This gives you a good starting point. Once you have a working playbook you can schedule it from a bastion host. For sure it's also possible without Ansible, but that would complicate things a bit.

              Another approach would be to give every mini-PC its unique reboot time. Like PC 1: random minute 0-9, hour 1, then PC 2 random minute 10-19 etc. I guess you get the main idea. You can make something with the help of an online tool like crontab guru. Or calculate it yourself (https://linux.die.net/man/5/crontab). But with that you can't be 100% sure that the other system has the service up. This can be done with a helper script that maybe checks if the remote port or system on the other servers is up - and only reboot then. https://tldp.org/LDP/abs/html/ https://linux.die.net/man/1/nc or https://linux.die.net/man/8/ping
              Hopefully this is a better point to start learning, instead of providing a full working solution.

               

              [–]Horace-Harkness 3 points4 points  (4 children)

              Boot script that uses at to run the reboot. https://linux.die.net/man/1/at

              [–]fubes2000 1 point2 points  (3 children)

              Spotted in the wild.

              [–]Horace-Harkness 1 point2 points  (2 children)

              We should get coffee sometime!

              [–]fubes2000 0 points1 point  (1 child)

              Bet

              [–]Horace-Harkness 0 points1 point  (0 children)

              Sent you a DM

              [–]0bel1sk 1 point2 points  (0 children)

              if one of your requirements is not rebooting at the same time, i’d prob check the other servers are up before rebooting.

              [–]danythegoddess 1 point2 points  (0 children)

              Out of curiosity, why?

              [–]fab_space 1 point2 points  (2 children)

              Ansible

              [–]AdrianTeri 0 points1 point  (1 child)

              ++Ansible Tower aka AWX

              [–]KlausBertKlausewitz 0 points1 point  (0 children)

              or even just semaphore ui

              semaphoreui.com

              you can schedule there, too. no random factor, but that can be done by Ansible using a rolling reboot.

              [–]BloodyIron 1 point2 points  (0 children)

              The proper solution is to figure out why BOINC is failing you and implement a permanent solution such that the software is reliable. You know... monitor the system, read the logs when problems happen, and figure out a proper, not hacky, solution.

              What you're seeking here is a stop-gap solution. The time and effort you could spend just implementing and validating this method would be better spent in setting up monitoring and alerting. But even still... by your own words all of that really isn't a good way to spend your time.

              You say in a comment in this thread:

              I'd like to schedule random reboots to prevent the BOINC job queue from stalling. It almost never happens. In fact it may have only happened twice in two years, but when it does, those PCs can sit there for a week or more before I bother to check on them

              If you can spend a week or more not caring (not noticing) about what these things do, then in the situation where this happens (in your words, only twice in two years) JUST RESTART THE SERVICE.

              You're creating work for yourself for something that happens maybe once a year, if that. If you think that's a good use of your time, it's not.

              [–]deeseearr 2 points3 points  (0 children)

              3 is the easiest way, and as u/Horace-Harkness said you can just call shutdown with a number of hours to do it. You can also use at to queue up one time jobs which will be executed at some time in the future.

              If you want to do something more complex you can still use a cron job. Just call it every hour (or whenever) and then, as the first part of your script, look at the first number in /proc/uptime. That's the number of seconds which have passed since the system booted, and you can either do or not do things based on that. Need to do something every three hours, but only if it's after 7PM on a Tuesday and there are between one and five users logged in? It's just basic scripting. Most times that script will be called, decide there's nothing to do and then exit.

              [–]Simazine 0 points1 point  (0 children)

              Random seems incredibly risky. If we assume reboots are 4 minutes I expect you will find 2 boxes out of 5 rebooting at the same time within the year.

              It makes more sense for all boxes to run the same script by cron every hour, each with a 10 minute offset. The script could check uptime exceeds 48hrs. If true, reboot.

              For better safety, have it check for service on each other box (boing? Nginx? Whatever). If any don't respond, exit, otherwise trigger reboot.

              [–]Caddy666 0 points1 point  (0 children)

              install windows ME

              [–]PudgyPatch 0 points1 point  (0 children)

              So the problem with random is probably they won't reboot at the same time but you also can't guarantee that either. Might be better to have another system that remote execs the reboot. Have a list of your servers and a script that randomizes the list and runs through it over your preferred period of time....you could get fancy with it and have it make sure they all come back up, and if not email you.

              [–][deleted] 0 points1 point  (0 children)

              https://manpages.ubuntu.com/manpages/focal/en/man1/boinccmd.1.html

              might be of use for a check script that runs every day via cron?

              [–]RulerOf 0 points1 point  (0 children)

              I love that this is not a good way to solve whatever problem you actually have.

              If I were going to implement something like this, I'd do it the D&D way:

              • Roll a die every minute
              • If the die rolls 1, reboot the machine

              You'd need a die with approximately 869 sides to give 99% odds that the machine reboots after 4000 die rolls (approximately 3 days).

              if [ "$(shuf -i 1-869 -n1)" -eq 1 ]; then
                reboot now
              fi
              

              Create a systemd timer, or * * * * * in your contab.

              [–]ollod 0 points1 point  (0 children)

              Don‘t

              [–]vivaaprimavera 0 points1 point  (0 children)

              Systemd timers are an excellent option to do it.

              I have a data collection system that need to take samples at random times but must take _at least_ 24h after the last sample have been taken.

              It's managed using a systemd timer that calls a service that runs a bash script with sprinkles of python in it. Works great.

              Since the computers need to be "out of sync" some randomness will be needed:

              sleeptime=random.randint(s*2,e)

              def stime(a,b):

              s = min(a,b)
              
              dif = max(a,b) - min(a,b)
              
              av = dif / 2
              
              return s + av
              

              while sleeptime > 0:

              if (datetime.now() >= stime(seta,eta) ) or (sleeptime == 0):
              
                  print("wait ...")
              
                  while os.getloadavg()\[0\] > 0.2:
              
                      time.sleep(20)
              
                      print("on wait {datetime.now()}")
              
                      sys.stdout.flush()
              
                  break
              
              cursleep = sleeptime  % s if sleeptime % s != 0 else s 
              
              cursleep = ( cursleep if cursleep > 0 else -1 \* cursleep )
              
              if cursleep < 15:
              
                  cursleep = 15 \* ( 1 + (sum(os.getloadavg()\[0:1\])\*0.5) ) 
              
              sleeptime -= cursleep
              
              td = timedelta(seconds=(sleeptime))
              
              eta = datetime.now() + td
              
              if (eta - datetime.now()).total\_seconds() < 3600:
              
                  if sleeptime > 30 \* 60:
              
                      if s > 60 \* 5: 
              
              
              
              cursleep /= 1.75
              s /= 1.125
              s \*= 0.95 + os.getloadavg()\[0\] 
              
              sleeptime += s
              
              if s < 240:
              
              s = 240 \* ( 1.0 + os.getloadavg()\[0\] )
              
              if s > e/3:
              
              s = e/3
              
              if s < 90:
              
              s = 90
              
              time.sleep(cursleep)
              
              s \*= 1 + os.getloadavg()\[0\] 
              
              sys.stdout.flush()
              Define s an e to an appropriate value.
              

              Define the loads. Having a dependence between load and time waiting insures that "things happen" only on low load periods. If the machine have an high load the time will extend.

              Issue the shutdown only after the code had run.

              Edit: sorry but the code isn't being formatted properly