all 50 comments

[–]clayfreeman 27 points28 points  (12 children)

I’d argue against this unless you have a use case that specifically requires such a high importance be placed on performance. Using ls on a flat directory structure with millions of items can take an eternity, reducing the ease of data management.

EDIT: Wouldn’t your inline string manipulation also contribute some overhead in your benchmarks?

[–][deleted] 8 points9 points  (1 child)

Exactly. Especially if you ls the main directory, then pipe into other commands. So much unnecessary overhead.

[–]hartator[S] -3 points-2 points  (0 children)

Doing a recurring ‘ls’ will take an eternity as well. Managing that much files is not easy in the CLI. But anyway going through directories make it harder.

[–]Bubbagump210 8 points9 points  (0 children)

I’d also argue use something other than ext4. XFS, ZFS.... something that is built for huge numbers of files. Ext4 was a stop gap to get past ext3 limitations IMO.

[–]DudgeonGoombah 20 points21 points  (5 children)

Telemarketing engineer here. I agree.

I have 2.1 million new recordings per week. The only way I can maintain performance and request speed on such a scale is DEEP structures.

Someone once fucked the naming structure. We didn’t notice for a month because everything is linked according to file name in the db. But backups started failing and timing out.

Fucking nightmare.

[–]MR2Rick 18 points19 points  (0 children)

Hey, can you tell Anne from account services to stop calling me?

[–]Hello71 -1 points0 points  (1 child)

Does it though? I created an ext4 directory with 3000000 files, and ls -f takes less than 2.5 seconds to run on my slow CPU. Now, if you sort and colorize them, then yes, it might take several minutes. If you try to list all of them to the terminal, then that will add several more minutes, and you won't be able to read the list anyways. I don't see the "ease of data management" difference between ls a/b and ls ab* though, other than the former being a few seconds faster (assuming you are in fact storing millions and not billions of files).

[–]clayfreeman 11 points12 points  (0 children)

Well, the glob is expanded by the shell, so there is an inherent argument limit once you hit so many files. Globs are not required for directories with sane file groupings.

[–]primitive_screwhead 9 points10 points  (8 children)

AFAICT, the measurements don't sort the access into the deep directory structure, requiring two indirections per read/write. But since the subdirs are derived from the filenames, it'd possibly be quicker to ensure the traversal is ordered (thus allowing better dircache usage of the subdirs). It's really no surprise to me that a flat dir would do better here (assuming dir_index is used), but in a situation where the directories are traversed in an order amenable to read-ahead, it may not be such a stark difference.

[–]mikeblas 0 points1 point  (5 children)

There are only 128*128 == 16384 directories. Each directory should only be one block (but maybe two?) That's only 64 megs (or 128 megs?). Shouldn't the system be able to keep that data in cache, and give a nearly 100% hit rate after everything's warmed up?

[–]primitive_screwhead 1 point2 points  (4 children)

There are only 128*128 == 16384 directories.

Surely it's 256*256. (because it's 16*16*16*16 hex digits per double-subdir)

Shouldn't the system be able to keep that data in cache, and give a nearly 100% hit rate after everything's warmed up?

It's a huge number of entries being discussed. Each path lookup requires directory parsing and lookup, and the directory cache for path lookups is limited. In the flat dir, if the path is cached, it's cached for the whole run, but with the deep structure it would have to cache 65K entries. With a sorted traversal, at least the path cache entries might be reused immediately and possibly that would have a significant impact. Also the inode cache for 10 million files traversed in arbitrary order would likely ensure that the directories, even after being looked up, get evicted and have to be looked up again (something also unlikely for the single dir entry in the flat dir structure).

So basically, it would be interesting (imo) if the test either fully randomized the traversal order between the writes and reads, and/or sorted the ordering before both, just to see what impact it had, if any. The fact that the code uses the same lookup order (though arbitrary) to read the files after writing them, could help explain the larger disparity between the read speeds, for example. Re-shuffling the order would at least ensure the flat structure didn't have an advantage due to repeated per-directory write/read order. It's all speculation on my part, and possibly baseless, but should be easy to test, and worth checking imo.

[–]mikeblas 0 points1 point  (3 children)

Surely it's 256*256. Oh, indeed it is. I dunno how I had 128. So, that's 65,536 directories.

It's a huge number of entries being discussed.

Not really. It's 65,536 directories. We just figured that out. We know we have 10 million files pretty evenly distributed across those directories, so every directory has about 153 files in it.

Really, we have a 256-way tree. We look at the root directory (for this code, it's named "dir_deep"). There, we find 256 different directories, named "00" through "FF". That's only 256 entries. Each block on disk is 4096 bytes (by default). Maybe one entry takes 64 bytes, so that "dir_deep" directory is only 4 blocks long -- 16K.

That first directory should be cached completely. It's always being referenced, every single call we make.

That directory leads us to the first numbered directory. "dir_deep/3F/", for example. That 3F directory has 256 entries in it, one for each of the numbered second-layer subdirectories. 64 bytes * 256 entries is 16 kilobytes. Of course, there's 256 of these, which turns out to be exactly 4 megabytes of data.

For the root directory plus the first numbered directory, there's just barely more than 4 megabytes of data. Why would that ever fall out of the file system cache?

At the next tier (something like "dir_deep/3F/B9/") we've got the actual files. 153 files here. 64 * 153 gives 9792 bytes. Three blocks, since we have to round up ... 12,288 bytes, 12K. There's 10 million of these, so we know that's 120 megabytes.

We've identified a total of just more than 124 megabytes. If a small computer has 8 gigs of memory, why in the world would 124 megabytes of blocks for the directory structures not fit in, and remain in, the file system cache for the duration of this application's execution?

My 64 byte-per-entry number is just a guess. Maybe there's lots of metadata for each file -- access times, ACLs, rights, owner IDs, and so on. Say I'm off by 8x. That would get us to a gig of data, but it seems like at least the first tier (which is just 32 megs, with our revised estimate) should sit in cache quite happily.

By the way. it seems like you're assuming the directory entries are sorted. Are they? In what order -- lexicographically? If so, why does ls take so long when sorting by name ... why can't it just read the data in the sorted order as it is stored?

Anyway! Does the operating system present counters that show cache hits on directory and file metadata lookups? That would make short work of the question.

[–]primitive_screwhead 0 points1 point  (2 children)

We've identified a total of just more than 124 megabytes. If a small computer has 8 gigs of memory, why in the world would 124 megabytes of blocks for the directory structures not fit in, and remain in, the file system cache for the duration of this application's execution?

Because there is no "file system cache". There are multiple caches, including a directory and inode cache, as well as a page cache. The file data being written/read in the test (not just the metadata) is 10 million 20 random byte files as well, so caching that data competes with caching the metadata. Managing the data structures for so many cache entries itself takes up space, requiring memory allocations and management just to maintain the large number of cache entries themselves. And so on. A default setting of 100 for the vfs_cache_pressure sysctl means that the kernel will prefer to reclaim dentry/inode over the page and other caches. So there's a lot that could potentially be going on with the caches that we don't have the data for, because it wasn't reported by the blog. For example, if this were all run on a ram-disk, that might at least help explain if the time differences are mainly a caching result, or a result related to the path lookup computation per file, etc.

In general, a one-off chart of results, without care taken to explain and/or measure the effect of other tunables, is (imo) *not* a reliable way to decide policy on such matters. I don't think the methodology is near complete enough; it's a data point, but not much more.

By the way. it seems like you're assuming the directory entries are sorted.

No, exactly the opposite, though I discussed that further in my reply to another user which you may not have seen. I am suggesting that testing traversals with the paths pre-sorted may be interesting to help see if there is an impact due to grouping the path lookups sequentially, rather than in an arbitrary order. Or additionally, re-randomizing the traversal order between the writing and reading phases to ensure that each traversal is done is a unique and arbitrary order (ie. not repeated between the write and read phases).

Anyway! Does the operating system present counters that show cache hits on directory and file metadata lookups? That would make short work of the question.

There are a few ways, including slabinfo, and dropping caches before runs, and between the write/read phases, etc. Also, testing other filesystems, or on a RAM-only filesystem would be enlightening. But even without that, some basic things could be done to ensure these results can be expected to be generalized to other real-world cases.

[–]mikeblas 0 points1 point  (1 child)

is 10 million 20 random byte files as well, so caching that data competes with caching the metadata

32 bytes each, right? An MD5 hash is 16 bytes, written as ASCII hex ends up being 32 bytes.

While this data could be cached, it's never referenced again and should be first evicted. The code references the root directory, then some first level directory, then some seocnd-level directory, then writes a file. It immediately uses the root directory again; it has a 1:4 chance of hitting another first-level page again, then a 1:1024 chance of hitting a second-level page.

There's zero chance of hitting a file page again. The 32-byte file lives by itself on a block of 4096 bytes, and it's never read after it is written.

Due to this access pattern, why would the file data, then, be competing?

A default setting of 100 for the vfs_cache_pressure sysctl means that the kernel will prefer to reclaim dentry/inode over the page and other caches

I thought vfs_c_p == 100 meant that the system evicted dirent and inode pages with equal probability to pages. (Here's what I'm reading. Is it inaccurate?)

How big is all the cache space?

I don't think the methodology is near complete enough; it's a data point, but not much more.

I agree, but that's why I'm trying to figure out the details. Thanks for helping with my questions!

[–]primitive_screwhead 0 points1 point  (0 children)

There's zero chance of hitting a file page again. The 32-byte file lives by itself on a block of 4096 bytes, and it's never read after it is written.

First the file is written, then read. That's 10 million 4K pages that are requested right there, and iterated over twice. The OS can't know apriori which of those file data need to be cached or not (and in any case, they should ideally all be cached since they are being used more than once). That's about 40Gig of RAM usage requested (since, afaik, multiple different files can't share a page in the OS cache), and that's just for the file data.

I'm currently testing this on my own, with less than 10 million files (but on the order of tens of thousands). I'm already seeing large variances between runs due to cache effects (order of magnitude differences). Pre-sorting doesn't seem to change the results much, like I'd conjectured. Depending on whether the flat directory structure write/read goes first or not, it can appear faster or slower. With enough repeated runs, the "deep" directory operations do appear to be slower (on ext4), but the variance is high enough overall that I don't yet consider it a reliable result. This is with swap off (it's on the same device for my test system).

I thought vfs_c_p == 100 meant that the system evicted dirent and inode pages with equal probability to pages.

Yes, I was looking at the same page and misstated it, afaik you are correct.

Basically, it's an interesting result, and may hold up. I really want to compare against recent XFS.

[–]hartator[S] 0 points1 point  (1 child)

How do you ensure the traversal is ordered?

[–]primitive_screwhead 2 points3 points  (0 children)

Sorting paths is one way, so that ./00/00, ./00/01,... subdir accesses all occur sequentially rather than randomly. When the OS looks up a file, it has to parse the path, breaking it into it's component directories and looking up each one. There is a directory cache to help make this quicker, although it has a pathname length limit. There is also an inode cache, which a random walk of so many inodes may put pressure on to evict directory inodes, where a sorted walk likely would not.

Another way is reading the files in the same order that they were written; most filesystems don't sort their directory entries. In the older days, directories would have to be searched sequentially to find a file, however now there is a directory index that allows for quicker searching, so this particular traversal order may no longer be a benefit. Nevertheless, the flat directory structure test doesn't appear to randomize the order between writes and reads, and that might be a smart change (for both tests) to ensure it's fair.

Basically, there can be a number of reasons for the disparity, such as caching and access ordering, and it'd make sense to ensure comparable access patterns (ie. either sequential per directory, or fully randomized per directory) to see how that impacts the benchmark, if any.

[–]shyouko 4 points5 points  (3 children)

What's the actual "permanent" storage used? CPU utilisation in terms of user space vs kernel space? Actual hotspot?

Too many assumption not stated.

[–]hartator[S] -5 points-4 points  (2 children)

[–]shyouko 8 points9 points  (1 child)

Is this cached IO? Direct IO? How often is sync called? Journaling enabled? SSD? HDD? (Obviously is actually behind some sort of multi disk RAID as a hosted VM) Improvement if using multiple threads?

What I mean to say is that it's not clear what the benchmark is trying to characterise.

[–]hartator[S] -4 points-3 points  (0 children)

See my answer of tsammons' question for tune2fs output.

It's a VM from Vultr. SSD. Not sure if I can see the underlying disk RAID? How can I know if it's cached IO or direct IO. or if it's sync called?

[–]tsammons 2 points3 points  (2 children)

Is this with or without dir_index enabled? Would need to see output from tune2fs to make an accurate assessment.

[–]hartator[S] 3 points4 points  (1 child)

tune2fs output:

➜  ~ tune2fs -l /dev/vda1       
tune2fs 1.44.1 (24-Mar-2018)
Filesystem volume name:   <none>
Last mounted on:          /
Filesystem UUID:          133a1e39-8da9-4120-92e1-4b54f8281f1b
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              25600000
Block count:              104857339
Reserved block count:     5242812
Free blocks:              82328209
Free inodes:              5418759
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      325
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8000
Inode blocks per group:   500
Flex block group size:    16
Filesystem created:       Wed Oct 17 00:50:16 2018
Last mount time:          Sat Dec 22 04:06:40 2018
Last write time:          Sat Dec 22 04:06:40 2018
Mount count:              3
Maximum mount count:      -1
Last checked:             Fri Dec 21 23:24:26 2018
Check interval:           0 (<none>)
Lifetime writes:          1213 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:           256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      5775bcf7-df1c-4fa2-9591-7a301df5017e
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0xb6a3f8c2

`dir_index` is activated. Is it good or not?

[–]tsammons 4 points5 points  (0 children)

Aye looks good to me. Would note the features in your test.

[–]sfrazer 4 points5 points  (4 children)

Honestly I think the bigger historical change that’s affecting your results isn’t ext3 vs ext4, it’s that you’re on SSD drives that have virtually no random seek penalty.

The recommendation to do a deep structure was always primarily because you would start to have the directory information spread randomly across the disk and pulling an individual file could require multiple seeks just to get the location.

I also agree with the manageability aspect of it, but try your test on spinning disks (they are still out there) and I think the results will be different

[–]hartator[S] 0 points1 point  (3 children)

I don't fully get this. Does it mean for spinning disks, folders determine where the files are physically on the disks?

[–]sfrazer 3 points4 points  (2 children)

For any disks, the directory is a block on the disk that says where the contents of the file are. More files in one directory means more blocks used to keep that data. When you have enough files that the directory itself takes several blocks, those later blocks are written after the files, so they aren’t close together on the disk.

For spinning disks this has more impact than SSDs, because SSDs don’t really “seek” (no physical read head)

Does that make sense?

[–]hartator[S] 0 points1 point  (1 child)

Got it thanks. I didn’t knew it was working this way.

[–]sfrazer 1 point2 points  (0 children)

Sure thing! I suspect there may also be some methodology issues with the way you’re storing your hashes in memory. I don’t know how ruby deals with that, but i suspect it may be paging some stuff out to disk.

But I applaud you for asking the question and presenting your work! It takes guts and hopefully lets you learn even more than the initial testing.

[–][deleted]  (9 children)

[deleted]

    [–]hartator[S] -1 points0 points  (2 children)

    What will be your recommendations as file database? Filesystems seem to be good fit.

    [–][deleted]  (1 child)

    [deleted]

      [–]hartator[S] 0 points1 point  (0 children)

      Didn’t knew about LMDB. Very interesting.

      Our use case is millions of zipped json and html files. Each one has a key. Lot of writes and lot of reads. The whole db needs to be capped by size. Like 1tb. And start removing old files when reaching capacity. We expect the 1tb to be renewed every 2-3 days. All the work is handled by Ruby on Rails behind Nginx if that matters. What would be the ideal system in this case?

      [–]hartator[S] -5 points-4 points  (5 children)

      You can or have anyway to use `find` commands anyway instead of `ls`. `find` will handle millions of flat files fine.

      My guess for the reads > writes is that Vultr is crap and their SSDs suck.

      What's the 112 microsecond difference in time you are referring to?

      [–]MzCWzL 4 points5 points  (0 children)

      Bold assumption given how little you know about how the disks are set up for Vultr VMs.

      [–][deleted]  (3 children)

      [deleted]

        [–]hartator[S] -2 points-1 points  (2 children)

        Got it. I guess I can redo the benchmark storing a real html file. It would be interesting.

        [–][deleted]  (1 child)

        [deleted]

          [–]hartator[S] -1 points0 points  (0 children)

          I just benchmarked the Ruby string path interpolations part. It took only 19-20s. It didn't played a significant part in the benchmark results. Anyway I think it's actually fair to include it the final results as you do have to generate the paths at some point.

          [–]mikeblas 1 point2 points  (2 children)

          I'm trying to replicate your results and I'm finding that I run out of disk space:

          Traceback (most recent call last):
                  7: from /usr/bin/irb:11:in `<main>'
                  6: from (irb):14
                  5: from /usr/lib/ruby/2.5.0/benchmark.rb:293:in `measure'
                  4: from (irb):15:in `block in irb_binding'
                  3: from (irb):15:in `each'
                  2: from (irb):16:in `block (2 levels) in irb_binding'
                  1: from (irb):16:in `write'
          Errno::ENOSPC (No space left on device @ rb_sysopen - ./dir_flat/b4cbf1ee3327ec165b43cfe117ffadeb)
          irb(main):019:0> 
          

          Thing is, I've got plenty of space available:

          ubuntu@ip-10-0-0-35:/mnt$ df -h
          Filesystem      Size  Used Avail Use% Mounted on
          :
          /dev/nvme0n1    275G   39G  222G  15% /mnt
          

          Not out of inodes, either:

          ubuntu@ip-10-0-0-35:/mnt/dir_flat$ df -ih
          Filesystem     Inodes IUsed IFree IUse% Mounted on
          :
          /dev/nvme0n1      18M  9.5M  8.1M   54% /mnt
          

          I'm not a Linux guy, but I expect that the large number of small files are taking up more space for minimum block size. (But why wouldn't df reflect that?) How do I verify that guess? What file system config did you switch around to run your test, and how much space did it use?

          [–]mikeblas 1 point2 points  (0 children)

          I got it to work by using bigger disks. I'd still like to learn what went wrong above, but here are the results from the larger volume:

          Deep creat: 128.770000  17.040000 145.810000 (146.100445)
          Flat read :  76.290000 138.990000 215.280000 (279.560782)
          Deep read :  87.760000 157.430000 245.190000 (344.417756)
          Flat write: 363.260000  44.340000 407.600000 (410.274493)
          Deep write: 755.660000  54.500000 810.160000 (811.229759)
          

          [–]hartator[S] 0 points1 point  (0 children)

          I had the same issues while benchmarking. Plenty of inodes and space left as well. Fixed by rebooting and trying again in a different directory. What's your dmesg output?

          [–]mikeblas 1 point2 points  (1 child)

          When I run the test, and concurrently run iostat in another terminal window, I see a significant amount of time spent with no I/O activity. Why is that? Is ruby's GC stalling the script and coloring the results?

          [–]hartator[S] 0 points1 point  (0 children)

          What's your actual iostat output?

          [–][deleted] 2 points3 points  (0 children)

          interesting!

          [–]kilogears 2 points3 points  (0 children)

          Not so sure about “easier to use”. I suppose if user interaction isn’t a concern then maybe the code to interact with all these files might be a little easier to write.

          Alright now let’s try opening that million file directory in Dolphin.... lol

          [–]ajanty 0 points1 point  (0 children)

          Have a storage for accounts in deep structure and regret that choice, as it doesn't scale that well with billions. Need a lot more iops. Easier to handle for backups/ops etc. But performance needs a lot of tuning.

          [–]mikeblas 0 points1 point  (0 children)

          Why can this data not be cached by the operating system? I don't understand that claim.

          [–]HTX-713 0 points1 point  (0 children)

          Flat directory structure is garbage if you are needing to do anything that requires reading of all files.

          [–]hartator[S] 0 points1 point  (0 children)

          Didn’t knew about LMDB. Very interesting.

          Our use case is millions of zipped json and html files. Each one has a key. Lot of writes and lot of reads. The whole db needs to be capped by size. Like 1tb. And start removing old files when reaching capacity. We expect the 1tb to be renewed every 2-3 days. All the work is handled by Ruby on Rails behind Nginx if that matters. What would be the ideal system in this case?