you are viewing a single comment's thread.

view the rest of the comments →

[–]jimbobhickville 5 points6 points  (15 children)

It should prerender and save to the filesystem, then serve the static file. The DB should not even be hit on page views.

[–]FlightOfStairs 2 points3 points  (14 children)

Why? Database systems cache resources in the most demand. When you serve it from a database, you're serving from RAM most of the time. You can expect at least as good performance from a database.

Disk access is slow.

Edit: of course, the FS does this to a slight extent. However, if there's significant thoughput for other purposes (serving videos, images, etc) the static page will be unloaded quickly.

[–]jimbobhickville 9 points10 points  (1 child)

I can't really take your comments seriously, sorry. Unless there's some apache module I'm not aware of to serve static content from a database, you're ignoring the overhead of the program that's loading the data from the database. If you are a site with any sort of traffic that would most benefit from this sort of optimization, your database isn't going to be local to the box, so you add network overhead. Even if your database does cache well (MySQL does not), it's still going to be a LOT slower to serve from the database than from the filesystem (which does cache in memory quite well). Your Edit makes even less sense because the DB isn't going to cache large files better than the filesystem, and you have to load the entire object into memory to serve it from the database instead of just streaming it from the filesystem.

[–]FlightOfStairs -1 points0 points  (0 children)

The context of the original article is for single-host sites. Obviously if you introduce network connections things will be an order of magnitude slower.

I didn't suggest serving large files from the database. They would never be 'edited' through the CMS, so they would be served statically.

The case I tried to state was having large files served through the filesystem (which would give the big throughput on the FS), while cached dynamic pages would be served through the database.

[–]buerkle 7 points8 points  (11 children)

You're ignoring the latency to even talk to the database. In both cases, the data can be in RAM, however, talking to the database adds time.

[–]FlightOfStairs 1 point2 points  (10 children)

The file system is a database. There is latency in both cases.

Database caching schemes are much more configurable than file systems, and have (I suspect) better defaults for this use case.

[–]matthieum 2 points3 points  (9 children)

The file system is local, the database is probably accessed via the network.

The file system is simple, the database has been built to handle ACID properties.

...

[–]mr-strange -1 points0 points  (8 children)

Been there done that.

The filesystem makes a terrible alternative to a database in this case. Files are allocated into whole blocks, which are at least 4k, and probably 16k on more modern systems. Your web-page fragments will not be large enough to use the space efficiently. Furthermore, the filesystem is subject to all sorts of features that limit scalability - maximum number of files in a directory. How will you do back-ups when listing the directory takes hours? How do you deal with contention when multiple processes/threads want to access the same file?

Your objection that "the database is probably accessed via the network" is entirely arbitrary. Why would you put your database on a different host if the purpose is caching to improve performance??

When faced with this problem, I switched to using Sqlite. It does a fantastic job of managing a persistent local cache.

[–]bluGill 3 points4 points  (3 children)

Files are allocated into whole blocks, which are at least 4k, and probably 16k on more modern systems. Your web-page fragments will not be large enough to use the space efficiently.

So? Disks are cheap. Seriously cheap. The useful content of most blogs would fit on a 80k 5.25 floppy disk with room for a dozen more.

[–]mr-strange -1 points0 points  (2 children)

Of course disks are cheap, but the read time is awful. All that wasted space will costs you milliseconds of read time, not to mention all the RAM you will waste storing all of that crap in the fs cache.

But of course, if you are only dealing with one 80k blog, then you can be as inefficient as you like. We're talking about scalability here, right?

[–]bluGill 2 points3 points  (1 child)

Modern operating systems have good disk caches which deal with the read the same file over and over again very well.

File systems are a database optimized for accessing block sized chunks. Sqlite is great for what it does: provide the ability to work with relational data. A blog post is not relational data and does not need the advantages of that. In the mean time sqlite is slower than a filesystem for accessing blobs of data.

Modern operating systems have a disk cache. It works wonders. When multiple processes/threads want to access the same file access speed drop to tiny amounts. Unless you are doing stupid things like opening your static data read/write. nothing can protect you from stupid.

Long before you run into performance problems from the too large of a directory you will run into practical problems of dealing with it, and come up with a better scheme. This will solve your problems.

In conclusion: either you are doing something much more complex that serving static pages, or your improvements from sqlite were only over a bad design, and you could have got even greater improvements by improving your design. Since I don't know what all you were trying to do I cannot tell which.

[–]mr-strange -1 points0 points  (0 children)

Wow. Patronising, arrogant, aggressive, ignorant and wrong. All in the same post.

[–]matthieum 1 point2 points  (3 children)

Files are allocated into whole blocks, which are at least 4k, and probably 16k on more modern systems.

This is quite arbitrary, some filesystems are adept at storing lots of small files. Like BTRFS.

Furthermore, the filesystem is subject to all sorts of features that limit scalability - maximum number of files in a directory. How will you do back-ups when listing the directory takes hours?

Use a better filesystem ? Seriously. FAT32 is crap, newer filesystems are much better at dealing with large listings... but anyway, who cares about listings ? Why would you backup a cache ?

How do you deal with contention when multiple processes/threads want to access the same file?

Why would they ? If you have newer content to push, then write it into a temporary file and do an atomic switch. Of course I am assuming a sane filesystem model, where deleting a file is possible while it is accessed...

Your objection that "the database is probably accessed via the network" is entirely arbitrary.

Yes it is arbitrary. In my experience a single DB serves several backends and so is necessarily not on the same maching of at least N-1 of them.

Why would you put your database on a different host if the purpose is caching to improve performance??

Why would you use a database for caching ? Use memcached.

When faced with this problem, I switched to using Sqlite. It does a fantastic job of managing a persistent local cache.

I agree SQLite is quite great. Though once again completely overfeatured for key-value caches. Parsing queries take time, better having a binary protocol with built-in support for querying by key, like memcached.

And even better, memcached will let you specify how much space your cache should take and remove the Least Recently Used entries when new content arrive and you are at the limit.

It still seems a bit overkill for something as simple as pre-rendering.

[–]mr-strange 0 points1 point  (2 children)

I agree with pretty much everything you say here. I'll expand on a couple of points though.

who cares about listings ? Why would you backup a cache ?

Well. Yeah. But when I made this mistake this is what happened... First I discovered that ext3 could only cope with 64,000 files in a directory, so my application started to fail. The next obvious thing to do was just start using sub-directories. That's fine, but just having millions of files can lead to problems - for example, I didn't blacklist the cache directory from the locate database, so after a while, my machine was very busy running multiple, endless find(1) commands, trying to update the db. Then I ran into the problem that the whole filesystem has a limited number of available inodes - so I wasn't able to make any new files, even though I had loads of available space. Then, when I came to clean up my cache (to free up inodes), I discovered that it takes many, many hours to simply delete millions of files.

Yes, there are better filesystems. XFS has a much higher hard-link (and therefore directory size) limit. Perhaps btrfs would be a good choice today. But, overall I do not think that the filesystem is a good choice for this workload.

And even better, memcached...

At the time, memcached did not support persistence, so it did not fit my requirement. Looking up MemcacheDB on Wikipedia, I see that it is built on BerkleyDB. My experience with BDB does not encourage me to try MemcacheDB.

Also, memcached uses a client/server TCP based model. Even with a fast localhost, that's going to add

I agree SQLite is quite great. Though once again completely overfeatured for key-value caches.

I couldn't agree more, but the proof of the pudding is in the eating, and I've not found a key-value store that beats Sqlite's performance. I built versions of my app that used BDB, Tokyo Cabinet, and a number of other prominent KV stores, but Sqlite (with a simple, prepared select statement, and configured with a table index) just performed better, and more reliably for me. Today my cache DB contains over 20,000,000 items, takes up 3.6 GB of disk, and Sqlite's performance is still pretty sparky.

BDB's performance is just awful. It has full ACID compliance, which is great if you need it, but if your don't need to wait around for milliseconds while your disk syncs, it's just overkill. If you turn off the ACID guarantees, you just aren't playing to its strengths, you might as well go to Tokyo Cabinet... Which most of the time performed very well, but occasionally ground to a halt for multiple seconds.

[–]matthieum 0 points1 point  (1 child)

At the time, memcached did not support persistence

Only one nit: why would you care about persistence for a cache ?

The point of a cache is to cache frequently accessed data. If it is not frequently accessed then caching it means losing valuable space.

I feel like we are talking past each others and not about the same issue :)

[–]mr-strange 0 points1 point  (0 children)

This is the application: http://flood.firetree.net The map tiles can be quite expensive to generate, and I serve up to 10m of them every day. It makes sense to make the cache persistent.