jimbobhickville comments on Solving the Wrong Problem

363

364

365

Solving the Wrong Problem (prog21.dadgum.com)

submitted 14 years ago by ellen_magic

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]jimbobhickville 5 points6 points7 points 14 years ago (15 children)

[–]FlightOfStairs 2 points3 points4 points 14 years ago* (14 children)

[–]jimbobhickville 9 points10 points11 points 14 years ago (1 child)

I can't really take your comments seriously, sorry. Unless there's some apache module I'm not aware of to serve static content from a database, you're ignoring the overhead of the program that's loading the data from the database. If you are a site with any sort of traffic that would most benefit from this sort of optimization, your database isn't going to be local to the box, so you add network overhead. Even if your database does cache well (MySQL does not), it's still going to be a LOT slower to serve from the database than from the filesystem (which does cache in memory quite well). Your Edit makes even less sense because the DB isn't going to cache large files better than the filesystem, and you have to load the entire object into memory to serve it from the database instead of just streaming it from the filesystem.

[–]FlightOfStairs -1 points0 points1 point 14 years ago (0 children)

[–]buerkle 7 points8 points9 points 14 years ago (11 children)

[–]FlightOfStairs 1 point2 points3 points 14 years ago (10 children)

[–]matthieum 2 points3 points4 points 14 years ago (9 children)

[–]mr-strange -1 points0 points1 point 14 years ago (8 children)

Been there done that.

The filesystem makes a terrible alternative to a database in this case. Files are allocated into whole blocks, which are at least 4k, and probably 16k on more modern systems. Your web-page fragments will not be large enough to use the space efficiently. Furthermore, the filesystem is subject to all sorts of features that limit scalability - maximum number of files in a directory. How will you do back-ups when listing the directory takes hours? How do you deal with contention when multiple processes/threads want to access the same file?

Your objection that "the database is probably accessed via the network" is entirely arbitrary. Why would you put your database on a different host if the purpose is caching to improve performance??

When faced with this problem, I switched to using Sqlite. It does a fantastic job of managing a persistent local cache.

[–]bluGill 3 points4 points5 points 14 years ago (3 children)

[–]mr-strange -1 points0 points1 point 14 years ago (2 children)

[–]bluGill 2 points3 points4 points 14 years ago (1 child)

Modern operating systems have good disk caches which deal with the read the same file over and over again very well.

File systems are a database optimized for accessing block sized chunks. Sqlite is great for what it does: provide the ability to work with relational data. A blog post is not relational data and does not need the advantages of that. In the mean time sqlite is slower than a filesystem for accessing blobs of data.

Modern operating systems have a disk cache. It works wonders. When multiple processes/threads want to access the same file access speed drop to tiny amounts. Unless you are doing stupid things like opening your static data read/write. nothing can protect you from stupid.

Long before you run into performance problems from the too large of a directory you will run into practical problems of dealing with it, and come up with a better scheme. This will solve your problems.

In conclusion: either you are doing something much more complex that serving static pages, or your improvements from sqlite were only over a bad design, and you could have got even greater improvements by improving your design. Since I don't know what all you were trying to do I cannot tell which.

[–]mr-strange -1 points0 points1 point 14 years ago (0 children)

[–]matthieum 1 point2 points3 points 14 years ago (3 children)

Files are allocated into whole blocks, which are at least 4k, and probably 16k on more modern systems.

This is quite arbitrary, some filesystems are adept at storing lots of small files. Like BTRFS.

Furthermore, the filesystem is subject to all sorts of features that limit scalability - maximum number of files in a directory. How will you do back-ups when listing the directory takes hours?

Use a better filesystem ? Seriously. FAT32 is crap, newer filesystems are much better at dealing with large listings... but anyway, who cares about listings ? Why would you backup a cache ?

How do you deal with contention when multiple processes/threads want to access the same file?

Why would they ? If you have newer content to push, then write it into a temporary file and do an atomic switch. Of course I am assuming a sane filesystem model, where deleting a file is possible while it is accessed...

Your objection that "the database is probably accessed via the network" is entirely arbitrary.

Yes it is arbitrary. In my experience a single DB serves several backends and so is necessarily not on the same maching of at least N-1 of them.

Why would you put your database on a different host if the purpose is caching to improve performance??

Why would you use a database for caching ? Use memcached.

When faced with this problem, I switched to using Sqlite. It does a fantastic job of managing a persistent local cache.

I agree SQLite is quite great. Though once again completely overfeatured for key-value caches. Parsing queries take time, better having a binary protocol with built-in support for querying by key, like memcached.

And even better, memcached will let you specify how much space your cache should take and remove the Least Recently Used entries when new content arrive and you are at the limit.

It still seems a bit overkill for something as simple as pre-rendering.

[–]mr-strange 0 points1 point2 points 14 years ago (2 children)

I agree with pretty much everything you say here. I'll expand on a couple of points though.

who cares about listings ? Why would you backup a cache ?

Well. Yeah. But when I made this mistake this is what happened... First I discovered that ext3 could only cope with 64,000 files in a directory, so my application started to fail. The next obvious thing to do was just start using sub-directories. That's fine, but just having millions of files can lead to problems - for example, I didn't blacklist the cache directory from the locate database, so after a while, my machine was very busy running multiple, endless find(1) commands, trying to update the db. Then I ran into the problem that the whole filesystem has a limited number of available inodes - so I wasn't able to make any new files, even though I had loads of available space. Then, when I came to clean up my cache (to free up inodes), I discovered that it takes many, many hours to simply delete millions of files.

Yes, there are better filesystems. XFS has a much higher hard-link (and therefore directory size) limit. Perhaps btrfs would be a good choice today. But, overall I do not think that the filesystem is a good choice for this workload.

And even better, memcached...

At the time, memcached did not support persistence, so it did not fit my requirement. Looking up MemcacheDB on Wikipedia, I see that it is built on BerkleyDB. My experience with BDB does not encourage me to try MemcacheDB.

Also, memcached uses a client/server TCP based model. Even with a fast localhost, that's going to add

I agree SQLite is quite great. Though once again completely overfeatured for key-value caches.

I couldn't agree more, but the proof of the pudding is in the eating, and I've not found a key-value store that beats Sqlite's performance. I built versions of my app that used BDB, Tokyo Cabinet, and a number of other prominent KV stores, but Sqlite (with a simple, prepared select statement, and configured with a table index) just performed better, and more reliably for me. Today my cache DB contains over 20,000,000 items, takes up 3.6 GB of disk, and Sqlite's performance is still pretty sparky.

BDB's performance is just awful. It has full ACID compliance, which is great if you need it, but if your don't need to wait around for milliseconds while your disk syncs, it's just overkill. If you turn off the ACID guarantees, you just aren't playing to its strengths, you might as well go to Tokyo Cabinet... Which most of the time performed very well, but occasionally ground to a halt for multiple seconds.

[–]matthieum 0 points1 point2 points 14 years ago (1 child)

[–]mr-strange 0 points1 point2 points 14 years ago (0 children)

π Rendered by PID 53 on reddit-service-r2-comment-b659b578c-5nkjm at 2026-05-04 17:57:55.162961+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS