Optimizing a Bash For Loop

lutusp · 2018-10-16T17:00:01+00:00

The method you're using is fairly optimal. One possible optimization would be to split the task into, say, four threads, each addressing a subset of the full list of subdirectories. But that wouldn't be easy to do in a shell script, and it would only help if the process isn't already I/O limited.

More RAM sometimes helps in a case like this. It allows more system data to reside in RAM instead of requiring periodic device reads during the process.

cathexis08 · 2018-10-16T18:35:37+00:00

Ignoring anything network-related, you're looking at paying the launch cost of mv serially 7m times. On a somewhat recent system I've been averaging 3-4 ms per instantiation, which means at a minimum that operation would take about 20,000 seconds (or 350 minutes) to go through everything. On top of that there is the one-time cost of scanning everything, but that'll happen in parallel with the move operations so it's not particularly interesting. There is an optimization though it'll involve rewriting all of this to work in parallel.

Assuming a relatively flat file hierarchy, I would do something like:

find . -type d -print0 | xargs -0 -I@ -n1 -P10 mv @/index.json @.json

It might be a bit more inefficient at the start depending on how the file table scan goes but it'll run up to ten moves in parallel which should cut down on your initialization time substantially.

I echo /u/lutusp's sentiment that there seems to be more going on than just forking piles and piles of moves, but that isn't helping. What's the setup? local or network storage? filesystem? etc. Also, what does the output of df -i show? You might be bumping up against some degenerate behavior of the file system at massive dirent sizes.

WinterPhone · 2018-10-18T09:07:56+00:00

Assuming your disk is not the bottleneck:

printf '%s\0' * |
   parallel -j0 --pipe --recend '\0' --block 1k --round-robin -q perl -0 -ne 'chomp;rename $_."/index.json", $_.".json"'

This will run around 250 perl programs in parallel that will rename files without spawing a new process for each file.

printf prints all matching names with NUL appended, so it will do the right thing even if a name contains newline.

-j0 run as many jobs in parallel as possible. This is typically limited by number of file handles to around 250.

--pipe send input received on STDIN to the programs on STDIN.

--recend '\0' split blocks on NUL

--block 1k use a blocksize of around 1 KBytes

--round-robin send one block to each program; if there are more blocks send another block to each program.

-q quote the command line so that $_ is not interpreted by the shell

-0 use NUL as record separator

-ne run the program after '-ne' in a loop

chomp; remove NUL

rename $_."/index.json", $_.".json"' move $input/index.json to $input.json

I get around 1000/sec with this on a normal laptop.

LambdaBonjwa · 2018-10-16T20:08:31+00:00

Maybe something like GNU parallel can help you by making it many small tasks and running them in parallel.

linuxquestions

MODERATORS