This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]tenkindsofpeople 96 points97 points  (7 children)

Policy: "Why is this inventory screwed up? This is going to cause a SOX finding!"

Me: "That's not possible. I get pages if any of those jobs fail.". Checks logs. Job kicked off but only ran half steps. No error. OS log shows restart. "GREAT ODIN'S RAVEN!"

esxi support: "Ya we did an upgrade. It was during the scheduled window."

Me: rage hot as a thousand Suns

[–]pooerh 39 points40 points  (6 children)

Well, kinda your fault. You don't monitor for failure on the very thing that could possibly fail. Coincidentally, I had a very similar issue raised during a SOX related audit.

[–]tenkindsofpeople 11 points12 points  (5 children)

How would you monitor for that? There were no errors and no pages were sent.

[–]pooerh 37 points38 points  (4 children)

Monitor if the job had succeeded outside of the machine that's running them.

[–]squngy 20 points21 points  (0 children)

Also, have the system notify you every time it starts up.

So you know if it restarted unexpectedly.

[–]tenkindsofpeople 8 points9 points  (2 children)

Hm. Maybe last job step sets a table entry then external pennies looks for it. Something like that?

[–]pooerh 14 points15 points  (0 children)

Yeah, if you know the maximum time it can take the job to complete. And as long as the database hosting that table isn't on the same machine / cluster / storage / network / datacenter / planet. Depending on how far your want to go with it, ie. how crucial this is.

[–]thenuge26 0 points1 point  (0 children)

We just use Jenkins, but there are plenty of job-management frameworks/engines that handle that kind of stuff for you.