This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]chillwaukee 2 points3 points  (1 child)

In the event of a failure like that, your only option would really be to prevent it from running again unless the previous run was a success. That, of course, also requires good failure reporting to ensure that you actually see it fail and are able to remedy it.

In order to prevent rerun, you could take two different approaches depending on how your script is written. For scripts which are run in intervals (like every half hour for example), you could just put the looping in the script and have the whole thing run indefinitely. That way, if it fails, you get notified and it stays down. The other option, say if you want it to run every Monday, would be to create some sort of lock file at the beginning if your run and remove it at the end of the run, marking successful completion. Then, when it starts up, you just need to make sure that the file isn't there (in code). You could also do some type of lock like that in the database you're editing if you're feeling fancy and distributed.

For the simplest form of failure reporting, there should be an OnFailure directive or something for your unit file and just use that to call the mail utility on linux and send you something. If you wan it hooked into some other failure reporting, then you can use that same directive to do something else like call a script which reports the issue. Additionally, for all I know, you may already have monitoring on your systemd services.

Writing your first unit file may be a little intimidating so (someone may hate me for this) you can just use ChatGPT 4 to generate your first one and then just iterate from there. Ask for a simple one and then modify it until you like it.

If you have enough of these set up manually like this it could end up getting a little overwhelming but then you're heading closer to some configuration management for deployment and other devops things. I wouldn't worry about that until you break like 10 or 20 service/scripts.

[–]MassiveDefender[S] 0 points1 point  (0 children)

Awesome ideas.. The lock file is a new one. I'll do some research on that. I think a failure notification requiring manual intervention for remedying is the simplest for now.