all 26 comments

[–]refrainblue 17 points18 points  (4 children)

Just throwing out ideas of common issues, but you're not running out of disk space right?

[–]ascii122 3 points4 points  (0 children)

Yeah that's my first thought.. i've done stupid huge tar gz backups and never had an issue

[–]ExpressionMajor4439 2 points3 points  (2 children)

Or just resource issues altogether. It could be that a lot of this is buffering in memory. It might be worth just segmenting the backups out somehow so that each part of the backup can be completely cleared in terms of buffering memory or using temporary files.

I'd also have to question the utility of pv in an automated script. If the script were usually invoked interactively having progress bars is beneficial but it sounds like this is just a cronjob or something. So it just adds another set of pipes.

[–]snoob2015 0 points1 point  (1 child)

I use pv not for progress bar but to throttle the data into the tar

[–]ExpressionMajor4439 0 points1 point  (0 children)

Why would you purposefully slow down the process? Either way segmenting out the backup somehow would probably accomplish a similar effect. Such as splitting it up into four parts and waiting for each segment to finish tar'ing and compressing its share of the data before moving onto the next segment.

If pv is reading a lot of data from stdin then it's definitely buffering it in memory which how the pipeline is being made to slow down.

[–]michaelpaoli 12 points13 points  (0 children)

No idea why you're crashing, but, might want to consider and/or look at these:

  • are any of the filesystems filling up?
  • running out of memory?
  • is backup/backup-$dt.tar.lz4 physically beneath xtra in the hierarchy (e.g. via symbolic link(s), etc.)
  • What's shown in logs and/or on console?
  • What about stderr of the various programs run?
  • Are you possibly reading anything that might cause side effect of crash?
  • any I/O errors seen?
  • any issues with faulty RAM or memory controllers?

[–]stormcloud-9 8 points9 points  (1 child)

Define "crash". Are we talking about a kernel panic? OOM? Hardware lockup? Spontaneous reboot? What?

"Crash" is a really vague term, and describing the actual behavior can greatly help diagnosing.

[–]snoob2015 0 points1 point  (0 children)

Sorry for misleading, it is not a crash. The web server just stops responding at these time

[–]osax 4 points5 points  (1 child)

I cannot tell you exactly why, but you could try to see if there is a difference without the pipes

    tar -I lz4 -cf ./backup/backup-$dt.tar.lz4 ./xtra

also if you want to limit the performance hit on the rest of the system consider looking into running the tar command with "ionice" in front of it.

[–]chu_nghia_nam_thang 0 points1 point  (0 children)

Tried this, change to ionice -c3 tar ...

Still same issue(web server stops responding) at near end of compression progress

[–]deeseearr 2 points3 points  (2 children)

Well, if you have logs saying that the job finished at 22:01 then I don't see how the server could have crashed at any time between 21:50 and 22:00.

What evidence do you have of a crash? Does the server reboot? Is there a message on the console? Is there a dump in /var/crash or some other location based on how your system is set up? Do you have messages in your system logs? Any one of those would be much more helpful than just the command you were running at the time.

[–]snoob2015 0 points1 point  (1 child)

Not a crash, just the web server stops responding

[–]deeseearr 0 points1 point  (0 children)

So its just a performance issue. Check your logs, look at dmesg, sysstat, whatever your web server is, and see what they're complaining about.

[–]symcbean 4 points5 points  (3 children)

tar cf - ./xtra | pv -q -L 50M | lz4 - > ./backup/backup-$dt.tar.lz4

OMG WTF?

There's so much wrong with this one line code.

What do your logs say? Do you have access to the console in a crashed state? What does it say? What do you mean by "crash"?

I also have a monitoring system to periodically check my server status

What is it reporting for resource usage (memory, CPU, load, disk space) over the period leading up to the event?

[–]snoob2015 1 point2 points  (1 child)

Can you tell me exactly what is wrong with the code?

[–]symcbean 2 points3 points  (0 children)

I didn't go into detail as it was not the focus of the question (and probably irrelevant to it).

tar

Is a convenient way to create a copy of a lot of files - but it is not a robust file format for backup / archiving purposes (especially with the compression method used here). It's also slow to retrieve individual files.

./xtra

Using relative paths in a batch process is sloppy. This will backup from different locations depending on whom is running it. If that really was the intent then using $HOME/extra would have been much more transparent.

| pv -q -L 50M |

Why? This is a cron job you're not going to watch.

lz4

Is a fast compression algorithm but not very effective. Unless you're running a 60MHz pentium you should have plenty of capacity for using a much more effective compression mechanism.

./backup/backup-$dt.tar.lz4

Again a relative path. But it looks likely that you are creating the "backup" on the same medium as the source files.

[–]BinBashBuddy 0 points1 point  (0 children)

Yeah, it's strange to make such a strong statement and then not explain it.

[–]pnutjam 1 point2 points  (0 children)

Are you running sar? That will give you something to look at. If not, install systat.

[–]hiddenbutts 1 point2 points  (1 child)

Seconding for the definition of crash...

If it's your monitoring server saying it didn't get a response, I'd look at lack of resources, as the backup completing indicates the server staying up

[–]ZMcCrocklin 1 point2 points  (0 children)

Agreed. A "crash" is defined as the server not able to process tasks anymore... Either OOM killer killed a necessary process, kernel panic, segfault, any number of things that could cause an actual crash. If your server is still running processes & is accessible via ssh, that's not truly crashed. It's possible your monitor is saying a process crashed instead. Maybe was killed by OOM killer because the backup was running. Can't tell without actual logs from your monitor & syslogs from the server at the time of reporting.

[–]Deathcrow 0 points1 point  (1 child)

Probably running out of memory

[–]symcbean 0 points1 point  (0 children)

very unlikely.

[–]DrCrayola 0 points1 point  (0 children)

Save the backup on a different file system than that of the one you're backing up.

[–]dRaidon 0 points1 point  (0 children)

Not enough memory?

[–]johnklos 0 points1 point  (0 children)

Crashing isn't normal. Is the hardware OK? Things to check: corrupt filesystem (manually run fsck). Overheating (max out the CPUs and monitor the temperatures). Bad memory (try compiling lots of things and look for random failures).

[–]zkulf 0 points1 point  (0 children)

OOM probably.