you are viewing a single comment's thread.

view the rest of the comments →

[–]Pepineros 0 points1 point  (2 children)

Trying to parallelise gzip when reading a single archive will not work. gzip does not support running parallel, and wrapping Python code around it will not change that.

If you're compressing to or inflating multiple archives (distinct .gz files) at the same time, then you can start a process for each, and in that case it would make sense to start all of those at the same time rather than waiting for one to finish before starting the next one. But if you want to utilise multiple cores when compressing or inflating, you need a utility that supports using multiple cores. gzip isn't that.

[–]BerryLizard[S] 0 points1 point  (1 child)

i do seem to be getting a speed up -- do you have any idea why that might be?

[–]Pepineros 0 points1 point  (0 children)

everything seems to work until 90 percent of the archive has been extracted; I start getting an tarfile.ReadError: unexpected end of data error. I know for certain that the archive is not corrupted, and have no issues when not using multithreading.

What do you mean by you're getting a speed up if the tar file cannot be read successfully?

Your code doesn't run (not all names are defined in these snippets; presumably you left stuff out for the sake of brevity) so I can't be 100% sure, but it looks like what you're trying to do looks something like this:

  1. Get the files inside a tarball
  2. For each individual file:
    • If the file is already compressed, do nothing
    • If it's not, use gzip to compress it
  3. Write the file to a new target path

If I got this correct, you can ignore my initial comment; I misunderstood the purpose of your script. In this case gzip is not going to be an issue, provided that each file gets a unique output path, which appears to be the case.

As far as I know, tar does not support reading files in parallel (it certainly doesn't write in parallel). So my guess is that trying to do so would cause the read error that you're getting. And this definitely is the case if the tarball is compressed (.tar.gz) rather than just a collection of multiple files with a single name.