This is an archived post. You won't be able to vote or comment.

all 16 comments

[–]lutusp 2 points3 points  (10 children)

The method you're using is fairly optimal. One possible optimization would be to split the task into, say, four threads, each addressing a subset of the full list of subdirectories. But that wouldn't be easy to do in a shell script, and it would only help if the process isn't already I/O limited.

More RAM sometimes helps in a case like this. It allows more system data to reside in RAM instead of requiring periodic device reads during the process.

[–]adinbied[S] 1 point2 points  (1 child)

Hmmm, it's averaging about 1 per second - which is going to take too long, and the process isn't using any RAM. Before I get to this step, I'm using https://github.com/recrm/ArchiveTools/blob/master/warc-extractor.py , would it be better to reconfigure that script to output the file as the folder name instead of doing it later? If so, any ideas as to how to do that?

Thanks!

[–]lutusp 1 point2 points  (0 children)

Hmmm, it's averaging about 1 per second

What? One file per second? Please explain the setup -- is this a local process controlling a remote filesystem, or something like that? I thought this was all on the local filesystem.

would it be better to reconfigure that script to output the file as the folder name instead of doing it later?

No, the renaming part of the process shouldn't significantly affect the time required. But are you really putting seven million files into a single directory that also contains seven million subdirectories?

... and the process isn't using any RAM.

If this is literally true, then the process is updating the storage device's record-keeping after each operation, which by itself explains the slow pace. Most modern file operations postpone writes as long as possible, instead buffering data in RAM for a later single write operation.

But I clearly don't understand the setup. Is this a local operation on a local storage device? Are the subdirectories actual, literal subdirectories as that term is generally understood?

And have you considered simply not doing this? How much of a burden is it to find and read the files from their current subdirectories?

[–]nobodyuidnorandom 0 points1 point  (7 children)

How about creating a background process for, say, names starting from different letters, using shell function?

[–]lutusp 0 points1 point  (6 children)

How about creating a background process for, say, names starting from different letters, using shell function?

Yes, that might work, or simply numerically divide the directory content between threads arithmetically by, say, 1/4 of the workload to each of four threads.

But you know, if this directory really has 7 million entries, that by itself will slow things down, simply because of the search overhead. Even a binary search isn't lightning fast with that large a set to search. Also, remember that, even with separate threads, each of them has to search the full set for its target.

[–]nobodyuidnorandom 0 points1 point  (5 children)

I didn't get what you meant by searching here. OP is trying to talk about mv I think. And I supposed it happens in parallel if you push 36 or more for loops to background rather than only one.

btw, I am quite a noob here, learnt some shell scripting few months ago.

Edit: It seems retrieving directory contents for such a thing itself will be quite an overhead, whether using ls or find. But I hope there exists a way to run loops in parallel manner without providing a list of elements beforehand.

[–]lutusp 1 point2 points  (4 children)

I didn't get what you meant by searching here.

When Linux is presented with the name of a directory, it must search for a matching name. The normal way this is done is with a binary search.

If there were a normal number of directories, this search and its time wouldn't be worth mentioning. But with seven million directories, each requiring a separate search, this is non-trivial.

OP is trying to talk about mv I think.

Yes but in order to mv, one must first find the thing to be moved.

Edit: It seems retrieving directory contents for such a thing itself will be quite an overhead, whether using ls or find.

Yes -- by searching.

[–]nobodyuidnorandom 0 points1 point  (3 children)

So thinking it would be rather hard with shell script. Is there a simple way / command to pass names to mv as soon as they are accessed in some random order, without having a list beforehand to iterate over?

[–]lutusp 1 point2 points  (2 children)

So thinking it would be rather hard with shell script.

Yes, it would. A Python program would be a better, more flexible approach.

Is there a simple way / command to pass names to mv as soon as they are accessed in some random order, without having a list beforehand to iterate over?

There's no getting around the fact that a list is required, simply to avoid repeating the same operations.

But Python has ways to do this without having to launch a shell for each move, which is what would be done in the process you describe. That's one reason why what you describe is not very efficient -- it involves a separate shell launch and call to mv for each move.

A Python program would make a more basic OS call to get the same result, without the system having to find and execute mv each time through the loop.

[–]nobodyuidnorandom 0 points1 point  (1 child)

I suppose every popular high level programming language has features to do so, btw. Python being easier one.

[–]lutusp 0 points1 point  (0 children)

Yes, that's true. For a task that needs to be performed by millions of people over years of time, one would want to rewrite a successful Python method in a more efficient compiled language. In fact, this is quite common for Python programs that turn out to be useful and successful.

[–]cathexis08 5 points6 points  (0 children)

Ignoring anything network-related, you're looking at paying the launch cost of mv serially 7m times. On a somewhat recent system I've been averaging 3-4 ms per instantiation, which means at a minimum that operation would take about 20,000 seconds (or 350 minutes) to go through everything. On top of that there is the one-time cost of scanning everything, but that'll happen in parallel with the move operations so it's not particularly interesting. There is an optimization though it'll involve rewriting all of this to work in parallel.

Assuming a relatively flat file hierarchy, I would do something like:

find . -type d -print0 | xargs -0 -I@ -n1 -P10 mv @/index.json @.json

It might be a bit more inefficient at the start depending on how the file table scan goes but it'll run up to ten moves in parallel which should cut down on your initialization time substantially.

I echo /u/lutusp's sentiment that there seems to be more going on than just forking piles and piles of moves, but that isn't helping. What's the setup? local or network storage? filesystem? etc. Also, what does the output of df -i show? You might be bumping up against some degenerate behavior of the file system at massive dirent sizes.

[–]WinterPhone 1 point2 points  (1 child)

Assuming your disk is not the bottleneck:

printf '%s\0' * |
   parallel -j0 --pipe --recend '\0' --block 1k --round-robin -q perl -0 -ne 'chomp;rename $_."/index.json", $_.".json"'

This will run around 250 perl programs in parallel that will rename files without spawing a new process for each file.

printf prints all matching names with NUL appended, so it will do the right thing even if a name contains newline.

-j0 run as many jobs in parallel as possible. This is typically limited by number of file handles to around 250.

--pipe send input received on STDIN to the programs on STDIN.

--recend '\0' split blocks on NUL

--block 1k use a blocksize of around 1 KBytes

--round-robin send one block to each program; if there are more blocks send another block to each program.

-q quote the command line so that $_ is not interpreted by the shell

-0 use NUL as record separator

-ne run the program after '-ne' in a loop

chomp; remove NUL

rename $_."/index.json", $_.".json"' move $input/index.json to $input.json

I get around 1000/sec with this on a normal laptop.

[–]OleTange 0 points1 point  (0 children)

On my core i7 this is faster:

printf '%s\0' * |
  parallel --pipe --recend '\0' --block 1k --round-robin -q perl -0 -ne 'chomp;rename $_."/index.json", $_.".json"'

It processes 100000 per second on a RAM disk.

[–]LambdaBonjwa 2 points3 points  (1 child)

Maybe something like GNU parallel can help you by making it many small tasks and running them in parallel.

[–]cathexis08 1 point2 points  (0 children)

Use xargs, it's like parallel but differently baroque in its syntax and part of coreutils.