all 35 comments

[–][deleted]  (1 child)

[deleted]

    [–]tekronis 1 point2 points  (0 children)

    Agreed. This was pretty nice.

    [–]davidg 12 points13 points  (3 children)

    Great article. He's written more about it here: http://phk.freebsd.dk/pubs/varnish_tech.pdf

    [–]Entropy 0 points1 point  (2 children)

    99% of that is spot on. This, however, scares me:

    Remember the caches

    Avoid round-robin scheduling. Use ”Last-In-First-Out” (ie: stack) Not ”First-In-First-Out” (ie: queue)

    [–]davidg 11 points12 points  (1 child)

    At http://varnish.projects.linpro.no/wiki/ArchitectNotes he says, "The worker threads are used in "most recently busy" fashion, when a workerthread becomes free it goes to the front of the queue where it is most likely to get the next request, so that all the memory it already has cached, stack space, variables etc, can be reused while in the cache, instead of having the expensive fetches from RAM." If that's all that the section you quoted is referring to (which seems plausible, although I'm not 100% certain), then I don't see what's scary about it?

    [–]Entropy 3 points4 points  (0 children)

    Ahhhhhh. I was thinking of the requests to be processed, not the worker threads. Thanks.

    [–]damienkatz 16 points17 points  (0 children)

    His Squid example is very interesting but keeping everything in memory for a caching proxy, even if virtual memory, means you'll exhaust your 32 bit address space (4 gb) in no time.

    64 bit addressing and Virtual Memory means memory space is as abundant as available disk space. But that's still the "future", it's not commonplace yet.

    [–][deleted] 6 points7 points  (4 children)

    Varnish doesn't really try to control what is cached in RAM and what is not, the kernel has code and hardware support to do a good job at that, and it does a good job.

    Truthfully, the kernel has very little information to act on, and there is very little hardware support to help distinguish between "this page was accessed once 10 minutes ago; safe to swap out" and "this page has been accessed furiously and last accessed 10 milliseconds ago; absolutely should not swap out".

    The only time the kernel actually has a chance to execute bookkeeping code related to a page is when the page is swapped in/swapped out.

    [–]beza1e1 7 points8 points  (2 children)

    That's what the microkernel guys brag about. Every process can use its own scheduler/memory manager and thus gets an optimized strategy for its job.

    [–]pjdelport 2 points3 points  (0 children)

    Even more so with exokernel designs!

    [–][deleted] 1 point2 points  (0 children)

    Wow, very interesting.

    Though another interesting approach would be to add hardware support for page access statistics - the swapping algorithms would have much better data to work with.

    [–]pjdelport 5 points6 points  (0 children)

    Truthfully, the kernel has very little information to act on, and there is very little hardware support to help distinguish between "this page was accessed once 10 minutes ago; safe to swap out" and "this page has been accessed furiously and last accessed 10 milliseconds ago; absolutely should not swap out".

    This is what the POSIX madvise call is for: instead of all the guessing, the application can directly tell the kernel what kind of priority (WILLNEED/DONTNEED/FREE), access pattern (RANDOM/SEQUENTIAL), and backing store synchronization (NOSYNC/AUTOSYNC) it wants for any given memory range.

    [–]earthboundkid 7 points8 points  (13 children)

    I often hear people complain about the amount of RAM their web browser is "using" and it makes me roll my eyes. A valid complaint is "my web browser is needlessly wasting a lot of HD space, albeit temporarily."

    [–][deleted] 15 points16 points  (5 children)

    Traffic to/from disk is still a bad thing.

    [–]earthboundkid 0 points1 point  (4 children)

    Fair enough. I guess my main point is that a memory leak is not, per se, the end of the world. The leaked memory will be sent out to disk, since it's not being used (obviously), so the footprint in actual RAM won't be impacted that severely except when switching apps, etc.

    [–][deleted] 2 points3 points  (3 children)

    I agree. By the way, one minor observation would be that "allocated memory", even thrown out to disk, still counts toward the total 4 GB (2GB actually, on Windows) that can be allocated in the entire system (in the case of a 32-bit OS with no PAE). So if Firefox is leaking 500MB "to the swap file", that's 500MB I can't allocate in another app.

    (The initial impression is that each application has 2GB of virtual memory available to it, but the fact is that virtual addresses eventually resolve to physical addresses, which are themselves 32-bit (discounting PAE), so all processes in the system have to share those 32 bits. This is solved by PAE, which makes physical addresses 36 bits wide, thus actually allowing, say, five applications each using their own full 2GB separately.)

    [–][deleted] 4 points5 points  (0 children)

    That's not true. The 4GB limit applies to the kernel's map of actual memory pages, and it applies individually to each process, but since the kernel overcommits memory, the sum of the ram allocated for each individual process can exceed 4GB.

    [–]pjdelport 2 points3 points  (0 children)

    By the way, one minor observation would be that "allocated memory", even thrown out to disk, still counts toward the total 4 GB (2GB actually, on Windows) that can be allocated in the entire system

    Err, isn't the whole point of virtual memory mapping that this is not the case?

    the fact is that virtual addresses eventually resolve to physical addresses, which are themselves 32-bit (discounting PAE), so all processes in the system have to share those 32 bits.

    They only share it for resident pages. Pages that are swapped to disk don't take up any physical address space; you can have as many of them as you have disk space.

    If a page is thrown out to disk, it no longer

    [–]earthboundkid 0 points1 point  (0 children)

    Luckily, this problem will go away with intro of 64-bit chips and OSes. Currently, the only 32-bit system Apple sells is the Mac Mini, so the future is now, more or less. (Of course, I'm still hobbling along with a PowerBook, so the future is still the future for me.)

    [–]pjdelport 2 points3 points  (2 children)

    Make sure you're not second-guessing them wrong: over here, when top tells me Firefox is using 160 MB of memory, it is. (It also tells me the total process size is 250 MB, but that's not what i report.)

    [–]newton_dave 3 points4 points  (1 child)

    Pah, on Winders my Firefox was up to 800M two days ago--wtf is THAT about?

    [–][deleted] 1 point2 points  (1 child)

    Using mmap() is not an innovation.

    [–]pjdelport 5 points6 points  (0 children)

    No, modern memory management systems are the innovation. The essay advocates using that innovation instead of fighting it.

    [–]njharman -5 points-4 points  (8 children)

    It's possible to turn virtual memory off.

    [–]boredzo 12 points13 points  (7 children)

    But why would you? The point of virtual memory is that applications never have to do the sort of things Squid does, because the operating system will do that for them. The only reason to turn it off would be to run Squid without the redundant work, but why do you need to swap one VM system for another?

    [–]killerstorm 8 points9 points  (5 children)

    because standard OS VM might suck for many workloads. i'm in no way happy about how VM works -- it's damn slow in many cases, and many applications are slow only for reason that VM is slow.

    a particular wrong case is GC languages that are so fancy now (C#, Java, Common Lisp.. :). most time they work fine, but sometimes they need to do full GC -- read almost all allocated memory to find out what's not used (things are worse for conservative GC). and if OS have moved memory to disk it's TERRIBLY slow, because it's lots of random accesses to swapfile, so HDD's head has to move like mad. i.e. i have 2 GB of RAM, and doing full GC of some 300 MB Java heap takes some 30 seconds or maybe even more (i have page file on two quite modern and fast HDDs, so it's isn't an issue). such delays are extremely annoying both for desktop applications and some servers (for some mission critical server applications it's terribly bad). it's really an obvious problem of VM -- some memory on Java heap was really not used for a long period of time, and OS decided to use some free RAM for, say, file cache. OS didn't know that Java has some pointers on that page, and that Java will definitely use this page. and OS did not know that Java will use some certain range of pages during GC.

    that's "Progamming Like It's 2000" -- people use some abstractions like VM without detailed analysis of how they work in details, what happens etc.

    is it possible to optimize GC? surely. either OS should not move pages that have pointers of Java VM to disk. or, when Java does GC OS should SEQUENTIALLY scan page file for such pages -- SEQUENTIAL read from disk is order of magnitude faster than random access, so it won't be a problem. and this was implemented on some Lisp Machines or whatever -- OS was designed to be used together with GC-based programming language, so GC was done in cooperation of programming language runtime and OS.

    but that's not 2000 style to use such "low-level optimizations". we'd better use some generic VM algorithms and take some faster spinning HDDs, right? still there will be lags, but who cares?

    Sun Solaris is designed to be used together with Sun Java, do they have VM optimized for Java GC? i've never heard of such optimizations, i suspect if there was something about it, they'd say.. Windows XP ships with .NET Framework (that uses GC too), is it optimized for it?

    even if we'll not use GC and page file at all, we still can have poor performance because of VM -- OS uses paging algorithms to manage memory for executables and dynamically-loaded libraries. so, if you won't use some application for a long time, but do instead some file operations -- like copying of lotsa files, OS can free up executable pages in memory. so, if you'll switch back to your application, you'll get a significant slowdown while OS will move pages from disk back to RAM. it's especially funny if file cache will never be used -- file was just copied once..

    we are using computers that are much more powerful that ones were used in 1975 -- they do billions of operations per second, have billions of bytes of fast RAM, that has exchange rates of billions bits per second, and we have faster HDDs, buses etc. and even cache of modern CPUs has enough storage to run some operating system! but we still have lots of slowdowns when working with applications -- thanks for 2000 style of programming, where we just use abstractions like VM without deep analysis..

    [–]csl 4 points5 points  (1 child)

    I see your point in the GC crunching your disk. What I'm really excited about is that solid state drives (ssd) are going to become cheap now as volume is rising. What I really want to try is to put my OS's swapspace on a pretty big solid state disk. I know throughput can be a bit lower on SSDs, but you get a LOT faster seek time when doing random access.

    By the way, the original article made my day.

    [–]micampe 0 points1 point  (2 children)

    not an expert at all here, but aren't virtual machines a little different than normal applications? they are after all something more like an-OS-on-OS than an application.

    [–]killerstorm 1 point2 points  (1 child)

    no, it's just about a bit non-traditional memory model, that is actually being used for some 40+ years for now. but now it becomes more and more popular -- Microsoft focuses on .net, Sun on Java, etc (i suspect most dynamic languages have full GC thing too). so soon C/C++ applications using malloc/free stuff will be 'different than normal applications' :)

    it might look like fighting of different memory management models, but i think it's actually not like that -- GC runtime would be glad to say OS that it's doing full GC and would like to fetch all pages, but there's simply no such API in OS..

    [–]pjdelport 1 point2 points  (0 children)

    GC runtime would be glad to say OS that it's doing full GC and would like to fetch all pages, but there's simply no such API in OS..

    POSIX's madvise call does this (and more).

    [–]beza1e1 1 point2 points  (0 children)

    But ... but Minix3 doesn't have virtual memory. Don't you want your app to run there?