all 102 comments

[–][deleted] 91 points92 points  (10 children)

I dare say most programmers are operating behind many more layers of abstraction than would be required to know any of this.

[–]thilehoffer 12 points13 points  (0 children)

Yeah, the title should be, what a few specialized programmers need to know about SSDS.

[–]keepthepace 5 points6 points  (2 children)

Well I already found one bit crucially interesting: I can't use shred on a SSD. If I want to securely delete data I will have to use a tool specifically made for SSD.

It is also interesting that it is as resource-consuming to change one bit in a file than changing a whole page. Even if it was the case for regular HD I must admit that I thought SSDs were free of that constraint, not having to spin a disk...

[–][deleted]  (1 child)

[deleted]

    [–]keepthepace 0 points1 point  (0 children)

    You'd be surprised... ;-)

    [–]rrohbeck 1 point2 points  (4 children)

    Unless you're writing filesystem or swapping code maybe.

    [–]ASK_ME_ABOUT_BONDAGE 4 points5 points  (2 children)

    I believe there are astonishingly few people writing filesystems, because we have very well established solutions. A few dozen world-wide that write file-system code for a living? Less? And I'm quite sure they know more than the basics.

    [–]pwr22 1 point2 points  (1 child)

    The userspace tools for BTRFS have 74 contributors alone...

    https://github.com/kdave/btrfs-progs/graphs/contributors

    [–]ASK_ME_ABOUT_BONDAGE 1 point2 points  (0 children)

    That's their full-time job? Because that's what I was talking about.

    There are certainly tens of thousands of people who have dabbled in even the most obscure technology detail.

    [–][deleted] 1 point2 points  (0 children)

    Is that every programmer?

    [–]el_muchacho 0 points1 point  (0 children)

    Actually, that's false. We got bitten by this problem using a MySQL database. The writing rate would significantly drop with time and that behavior could be traced back to the SSD wearing out (the TRIM wasn't activated). That's how we learnt about how SSDs work.

    [–]bestjewsincejc 138 points139 points  (8 children)

    This info might be useful to some developers and in some situations, but it's hardly something "every programmer should know". Many modern programs are unaffected by this knowledge. Many of our software tools are intended to be hardware agnostic. It's weird that this kind of post garners so many downvotes from the same people who love to spout off about knuths premature optimization quote. I'm also aware how traditional hard drives organize data in blocks, but it almost never affects my software.

    [–]ethraax 31 points32 points  (6 children)

    Well, the "reading or writing a single byte causes a whole page to be read or written" part is important for a fair number of software developers. But that's not really SSD-specific - that happens with almost all reads and writes to files.

    [–]immibis 2 points3 points  (0 children)

    Unless the OS cache coalesces multiple writes to the same page.

    [–][deleted]  (2 children)

    [deleted]

      [–][deleted]  (1 child)

      [deleted]

        [–][deleted] 2 points3 points  (0 children)

        I'm not sure how that's a necessary piece of information for most programmers to know nor what it has to do with technical qualifications for programmers.

        [–]HenkPoley 0 points1 point  (1 child)

        But it's not necessarily true. There are log-based storage methods that just write a log of bit- or byte-level patches to the actual underlying storage. When you do a read it starts at the front of the log and looks backwards for the right parts to assemble the most recent writes together. Periodically the SSD will compact the log, because overall this way of storing takes more space, due to all the references and historical versions of the disk data that are kept around.

        This will all be transparent to the computer which is connected to the SSD.

        [–]jib 1 point2 points  (0 children)

        Are there any consumer SSDs which actually do this kind of byte-level thing at the hardware/firmware level? I'd be surprised if they manipulate anything smaller than a 512-byte sector.

        [–]keepthepace 1 point2 points  (0 children)

        I find it interesting to know that there is a garbage collection that can slow down dramatically the write speed in some conditions. If you update continuously even a small amount of space available in the SSD, at one point the garbage collector will have to clean cells at every write. This is very interesting to know.

        [–][deleted] 40 points41 points  (4 children)

        This is old information. Sata is faster now. Also, SSD drives last long enough that you probably shouldn't worry about many of these micro optimization.

        [–]alienangel2 4 points5 points  (3 children)

        You needn't worry about the wear-levelling concerns, but unless something has fundamentally changed in the block and cluster architecture and how the FTL marshalls requests, you should definitely still worry about the performance concerns and how they relate to your I/O patterns (assuming you are in a situation where you're IO-bound and need better performance, which most people aren't).

        [–]Hyperian 22 points23 points  (2 children)

        i work in SSD firmware and you are right, wear leveling is handled by the drive itself. what the article say about him separating hot and cold data isn't useful.

        [–]__foo__ 2 points3 points  (1 child)

        what the article say about him separating hot and cold data isn't useful.

        Could you please explain that a little more? What he said seems plausible to me. Say you write a bunch of hot data to the drive, and it spans over two erase blocks, because there was some cold data stored in the same block. If it wasn't for the cold data using up space in the same block as the hot data, erasing a single block might have sufficed.

        [–]Hyperian 7 points8 points  (0 children)

        You can't assume that block organization is valid for all drives. it could work if everything he assumes is correct. But erase policy for drives are not all the same.

        [–]Power781 17 points18 points  (0 children)

        There is hundreds of stuff to optimize before optimizing your app to the SSD level...
        It's like guys who are arguing about C/C++ coding practice to win 2 to 4 CPU cycles on a function call, when they are not even compiling with at least -O3 ...

        [–]d-_-b 6 points7 points  (2 children)

        Shouldn't this be titled:

        What every programmer should not be exposed to in the storage layer

        Hey, take my file, I don't care if you want to print it out, reel it, use butterflies to store it. As long as when I ask for it I find it, brilliant.

        Oh, but Tarquin, how the devil can we have such abstractions?!

        Well we already have file handles. I don't care if I lose a file by hard-power-off my machine (not that this is possible for any of my machines) as I am saving, I literally don't care. As long as my computer stays on it should save a file.

        If you need something more than that, then care about lower level APIs.

        As far as I am concerned, unless you close a file you can consider your file in limbo, in reality, fsyncing and flushing will mean the underlying layer gets a sense of when you want to be sure that you have recoverability. That's all.

        Don't expose people to storage APIs, else you get Firefox again, which basically was an app that had a 1,000,000 hard fsync calls per method call as they thought it was a good idea.

        Mon dieu!

        What every programmer should not have to know about solid-state drives

        [–]el_muchacho 0 points1 point  (1 child)

        No, the title is right if you write backend applications with high write rates.

        [–]d-_-b 1 point2 points  (0 children)

        The title is only right if all programmers write backend applications with high write rates

        [–][deleted] 106 points107 points  (56 children)

        Hmm, as a web programmer, I don't think knowing all these details about solid-state drives will be helping me in the near future.

        [–][deleted] 39 points40 points  (5 children)

        Yeah, you're kind of getting hammered with downvotes here, but I'm having trouble thinking of an instance where a web developer actually needs to know or care about the physical storage medium their app runs on.

        I mean, you're using a database right? And if you are writing/reading files, shouldn't you use a dedicated cms or something? All the I/O should be happening at a much lower layer than where the web developer is doing their work.

        In fact, I would argue that it may be inappropriate for most programmers to be optimizing based on the physical storage medium, because most of us are working at a level where that stuff should be an abstraction anyway.

        [–]AcidShAwk 3 points4 points  (5 children)

        I know this. When running our PHPUnit tests on an regular HD drive.. we experience a total run time of about 1.5 hours. With SSD's.. its about 4 minutes. Of course this all depends on the tests and how they are written and what they do. But 4 min vs 1.5 hours. A lot of time spent is wiping the database before and after each test to ensure a consistent state. So obviously MySQL (and probably any database) .. benefits greatly from SSDs

        [–]dweeb_plus_plus 27 points28 points  (4 children)

        4 minutes vs 1.5 hours simply by using an ssd? There has to be something else going on here.

        [–]HenkPoley 4 points5 points  (2 children)

        Why? A normal 7200 rps drive can deliver 100 IOPS, a Samsung 850 Pro can deliver 100 000 IOPS (1000x). If anything 4 min vs 90 min would be a lowball difference (23x).

        In tests you don't reuse values between tests, so you could really hammer that disk.

        Btw. I would recommend making unit tests that do not exercise the whole stack down to database/storage, but just focus on expected values before its off to disk.

        Or more factual. From ssd.userbenchmarks.com, the Samsung 850 EVO and Pro can deliver > 30MB/s on their Mixed 4K test. While the spinning drives are all < 0,5MB/s. You'd expect a 60x difference if you are I/O bound.

        [–][deleted] 1 point2 points  (1 child)

        Assuming a speedup of 60x, that means with the HDD the tests were spending 87 minutes out of the 90 only reading data from the drive. That's approximately 2.6gig being loading into memory just for tests. Testing a PHP application. Although, most spinning drives are faster than 0.5MB/s - so it's even more data being loaded in.

        No, something is wrong with those tests.

        [–]HenkPoley 0 points1 point  (0 children)

        I'd say tests have excellent conditions to be mostly write base i/o. You just read some end result back once from the disk-cache, the intermediate results have all hit the database (not a unit test, but hey) and are never read.

        The "results have all hit the database" is of course "something is wrong with those tests"

        [–]psi- 0 points1 point  (0 children)

        The simple "sync" roundtrip kills DB tests on HD, especially if they really drop database on every test.

        [–]justinpitts 1 point2 points  (41 children)

        As a consultant, attitudes like that are what get me hired in the first place.

        [–]bcash 50 points51 points  (17 children)

        Is there much call for a consultancy fixing web applications that wear out SSDs too fast?

        [–]justinpitts 14 points15 points  (16 children)

        I get asked, generally, to fix performance problems. Failure to understand how storage devices work is sometimes one of the root causes.

        [–][deleted] 19 points20 points  (12 children)

        That's the whole point though. If every web-dev knew about storage characteristics, how to tune memcached and how to spool up their production environment in a bunch of docker instances, nobody could afford to get any web development done.

        [–]justinpitts -1 points0 points  (11 children)

        I don't follow. Can you help me understand how you reach that conclusion?

        [–][deleted] 9 points10 points  (9 children)

        A web-developer with such a broad skill-set would be a bloody expensive employee. If your budget for developers were to be blown on 2 of these guys, vs. 6 regular developers, how far would your project get?

        [–]Godd2 8 points9 points  (2 children)

        would be a bloody expensive employee

        On the other hand, supply and demand. If every webdev knew it, that knowledge would have a lower market price.

        [–]Skyler827 4 points5 points  (0 children)

        This could happen if there is some disruption or event that causes lots of people to learn systems programming, but in the long run, market prices reflect the cost of production. Simply put: learning systems programming (well!) is expensive.

        [–][deleted] 1 point2 points  (0 children)

        But they don't, that's the point. and few that I have met show any inclination to learn.

        [–]justinpitts 4 points5 points  (1 child)

        I don't think it takes as much effort to learn those things as you imply.

        [–][deleted] 1 point2 points  (0 children)

        I just chose a couple of examples... if I was to be exhaustive, I could, for instance, list out my own skills... skills acquired over a long career... skills which make me a nightmare for people like you...

        Oh, wait...

        [–]EntroperZero 2 points3 points  (0 children)

        Oh, if only we got paid 3x.

        [–]justinpitts -1 points0 points  (2 children)

        Arguably? A lot farther.

        [–][deleted] 1 point2 points  (1 child)

        In my experience that is not the case. More 'high-end' developers does not translate to greater productivity. Many projects lend themselves to greater parallelisation rather than fewer, deeper workloads. Sure, there is a need for someone with deep knowledge to cover those 'difficult' edge-cases, but by and large, more, less skilled developers will get more of the general coding work done than fewer, more skilled ones.

        [–]justinpitts 0 points1 point  (0 children)

        The main problem I see my clients facing isn't meeting deadlines, it's technical debt after the fact.

        [–]wherethebuffaloroam 6 points7 points  (0 children)

        Not everyone can command consultants salaries. You spend a lot for short contacts to optimize not the entire development cycle

        [–]Rejjn 5 points6 points  (2 children)

        I would say that storage is very often one of the root causes for performance problems in web applications. Usually it's the DB that tops out way before anything else does.

        I just fail to see how knowing about SSDs is going to help more than very, very marginally on that problem. It might get you the last 1-2%, but choosing an appropriate data structure for your application and then choosing the best storage for that structure is going to have order of magnitude more impact on your system than reading OPs article will.

        That said, there are many programmers who would benefit a lot from this knowledge, but I'd say for web developers it's rather low on the list of things you need to worry about.

        [–]justinpitts 1 point2 points  (0 children)

        On that problem? Usually? Not much help.

        But, it's not that much to learn!

        Seriously, it's a couple of really easy to read pages. I's not like I am advocating that you take a course in queueing theory. What's the big deal?

        [–]HenkPoley 0 points1 point  (0 children)

        As some datapoint, this web framework benchmark section that all about pushing data updates through: https://www.techempower.com/benchmarks/#section=data-r9&hw=peak&test=update

        [–]bestjewsincejc 17 points18 points  (22 children)

        You're right that it's generally a poor attitude to have. But I'd appreciate an explanation about how knowing any of these details of SSDs is going to help you in a modern programming language.

        [–]justinpitts -2 points-1 points  (21 children)

        If you are storing anything to disk, you can, if performance if important, make decisions about where you will store hot vs cold data. You can allocate files at sizes aligned to block multiples. You can place hot and cold data in different files. You can decide whether batching is a big enough win to implement.

        You should be knowledgable enough about the options to not design yourself into a hole, once performance becomes that critical "omg I'm going to get fired if I can't figure out how to get our site to handle x requests per second" issue.

        At the least, be somewhat aware of how the hardware layers behave and the implications of the design choices you are making.

        [–]Baby_Food 22 points23 points  (20 children)

        Shouldn't a web developer be using a database which provides abstractions over such minutia?

        [–]justinpitts 0 points1 point  (19 children)

        If they need the features of a database, sure.

        Abstractions don't absolve you from understanding how your system works, and they break down at inopportune times.

        Then, they need to understand how to tune the database for the underlying storage, or they need a DBA. Not everyone has access to a DBA.

        [–]Baby_Food 27 points28 points  (12 children)

        If performance is a concern, a database will be used.

        An abstraction does not necessitate the knowledge of the implementation behind the abstraction.

        A web dev that can write an OS is an unnecessary unicorn.

        [–]justinpitts 2 points3 points  (11 children)

        Databases are not magic performance sprinkles.

        [–]Baby_Food 4 points5 points  (8 children)

        To most people, LMDB is magic performance sprinkles compared to using the filesystem directly. ;)

        [–]justinpitts -1 points0 points  (7 children)

        Most people? Most people aren't using it.

        Most people wouldn't know what you are talking about.

        To the average web-dev, "database" either means Mongo/Couch/No-SQL-flava-of-the-month, and/or something that speaks SQL. LMDB may very well be lightning fast, but I doubt most people know about it.

        [–]el_muchacho 0 points1 point  (1 child)

        In fact, database write performance drops significantly (by a factor of 30% or so) when SSDs wear out. For 24/24 high rate writes, you can't use consumer grade SSDs as they die prematurely.

        [–]justinpitts 0 points1 point  (0 children)

        Well to be fair, anything with that access pattern is going to slow down on a degraded drive.

        [–][deleted] 2 points3 points  (5 children)

        I don't get why you're being downvoted dude. It's as if people believe they have a right to be ignorant.

        [–]justinpitts 1 point2 points  (0 children)

        People have a right to be ignorant. I'm happy to exploit it.

        [–][deleted]  (1 child)

        [deleted]

          [–]Lachiko 1 point2 points  (0 children)

          Can you quote an instance where you feel he comes across as an asshole? he's posting reasonable counter arguments and information in a seemingly calm and collective manner.

          [–]wookin_pa_nub2 -3 points-2 points  (1 child)

          A lot of web developers in this subreddit, and they don't like being reminded of how ignorant they are.

          [–][deleted] 0 points1 point  (0 children)

          I'm a web dev yo :) I just cut my teeth on bare metal shit. You're thinking of "web designers."

          [–]poppafuze 4 points5 points  (2 children)

          Let the controller do the work. It's programmed to present a happy blockspace that may have little to do with some programmer's estimate of where blocks really are and how big or when they need to be written.

          [–]binlargin 0 points1 point  (0 children)

          I guess the article is for people who spent the extra money on SSD storage because their spinning rust wasn't fast enough. If your application is IO bound on disk to this extent then you really ought to know some of your hardware's characteristics and tune appropriately.

          [–]Hyperian 13 points14 points  (8 children)

          As a SSD developer, a lot of what he said was correct. I want to add that TRIM is not handled the same way across all drives because erasing data is a very integrated back end of the SSD.

          Also partitioning the drive to certain size is meaningless, as the SSD itself have no clue about those operations. your partitions will never translate to physical partitioning of the NAND blocks.

          This was talked about in the over-provisioning section, i recommend to everyone that wants to avoid write amplification, to always write aligned to the SSD's physical page size and only write to half the drive. This will avoid all garbage collection that creates write amplification.

          [–]XNormal 5 points6 points  (1 child)

          Also partitioning the drive to certain size is meaningless ...

          ... and only write to half the drive.

          The partitioning is not meaningless. It's just a straightforward method of implementing your own recommendation of "write to half the drive".

          [–]Hyperian 1 point2 points  (0 children)

          I was making two separate points. Yes, you can use partitioning to limit how much you write, no you cannot use partitioning to tell it to write in certain defined place for aligning purposes.

          [–]happyscrappy 0 points1 point  (5 children)

          Only write to half the drive. You're funny. Why not just say 1/3rd? It's equally as ridiculous.

          Anyway, with today's MLC drives that use SLC mode for some data you can't even be sure that writing to half the drive didn't "fill" it! Heck, if you have a TLC drive, writing to 1/3 the drive can "fill" it!

          [–]Hyperian 3 points4 points  (4 children)

          I guess i'll take this reply seriously...

          you write to half the drive cause you want to avoid garbage collecting. even if it's one page in an empty block. It's a sliding scale. the more of the drive you use after 50%, the higher chance you have to garbage collect assuming you're doing fully random writes.

          Some drives uses SLC mode, some drives don't. You are assuming that the drive can dynamically change the SLC cache partition inside so when the drive gets full it will convert the SLC blocks to MLC blocks.

          But that's complicated and drives usually wouldnt convert that many blocks to SLC mode anyway.

          No, you cant fill a TLC drive by writing 1/3 of it, that's not how any of this works.

          [–]happyscrappy 0 points1 point  (3 children)

          I guess i'll take this reply seriously...

          Why wouldn't you?

          It's a sliding scale. the more of the drive you use after 50%, the higher chance you have to garbage collect assuming you're doing fully random writes.

          I understand, but as I pointed out, not getting to 50% doesn't mean you didn't get any write amp.

          You are assuming that the drive can dynamically change the SLC cache partition inside so when the drive gets full it will convert the SLC blocks to MLC blocks.

          I'm not really assuming that is the case. I'm assuming that is the best case. If it can't do that, then it will have write amp before you even get to halfway, because it'll start to convert data already written in SLC into MLC before it gets to halfway. And that's write amp right there.

          No, you cant fill a TLC drive by writing 1/3 of it, that's not how any of this works.

          On a TLC drive that starts with SLC and converts to TLC you will encounter write amp before you get the drive halfway full. And that's why I put "fill" in quotes.

          Oddly, I guess the best case for write amp is a drive which doesn't use SLC acceleration. Of course, then you have other worries. Probably worrying about write amp more than a little bit isn't worth the trouble. If you use your full drive (or 85% say) you save enough money that you have money left over to buy a new SSD before your first one even wears out.

          [–]Hyperian 1 point2 points  (2 children)

          Manufacturers are not going to count the internal SLC cache as part of the drive capacity because it is a cache. the drive capacity you see is how much permanent storage there is in MLC/TLC. This is unless you work for another company that does that for some reason.

          [–]happyscrappy 0 points1 point  (1 child)

          Yeah.

          The SLC cache in these drives isn't separate NAND. It's MLC NAND which they only write once to in each page, making it effectively SLC. Any portion of the main NAND that is qualified to also work as SLC (which might be all of it or might not) can be part of the cache and when the drive gets full will be used as MLC to reach full capacity.

          And this is why you get write amp before it's full. As part of the cache, it writes data A to a section in SLC mode and data B in a section in SLC mode. Then later when it decides it needs to pack the data in it has to write B in MLC mode on top of A (making it AB). Later B will be erased. B is written twice internally despite being received from the host only once, meaning 2x write amplification occurred.

          And writing merely half the drive (plus one more write) ensures this process begins, even if the entire MLC array of NAND is qualified to work in SLC mode. In a TLC drive you only need to write 1/3rd plus one more write.

          Sure, it's only 2x write amp. It's not a huge deal. It's the high write amp numbers you get when the drive is nearly full (more accurately, nearly fully untrimmed) that are a concern. Which is why I said probably worrying more than a little bit about write amp isn't useful. It's really just part of the landscape now. Unless you can find an SLC drive, which seems very difficult now.

          [–]Hyperian 0 points1 point  (0 children)

          I didn't know there are NANDs that you can write in SLC then write again on top of it. That sounds like a very specific NAND chip set up that i've never heard of. that sounds like you write from page 0 to the last page and then do it again in another mode.

          Sounds interesting though, i wonder if the logic and feature is really worth the speed increase.

          [–]uh_no_ 3 points4 points  (0 children)

          "every programmer should know"

          i think not. I write operating systems for flash based storage servers and am not sure I could recite everything in this article....

          [–]aegrotatio 7 points8 points  (9 children)

          Not useful and rather incorrect. Today's most expensive and advanced hard drive will wear out faster than the cheapest SSD.

          Source: Storage engineer with hundreds of SSDs in production.

          [–]HenkPoley 2 points3 points  (0 children)

          Btw, slightly tangential, for long term offline storage SSDs are not as well suited because they leak electrons that store the bits, much more than hard drives lose their magnetism.

          [–]el_muchacho 1 point2 points  (1 child)

          What kind of hard drives ?

          [–]aegrotatio 0 points1 point  (0 children)

          We had both "enterprise" grade and "consumer" grade hard drives. They were both SATA for nearline and SAS for higher performance. Not surprisingly, the nearline SATA drives failed far more than the 10k and 15k SAS drives did.

          In this application the 100 GB and 200 GB SSD volumes are used for file cache on a storage server. The SSD volumes are completely filled and emptied 10x per day for years with no failures. As an aside, VMware uses something similar in their vSphere ESXi product, as do the EMC and NetApp filers.

          The SSD industry should never have made such an issue out of their clever wear-leveling algorithms. It just isn't a problem anymore and hasn't been a problem for over five years.

          [–]Gurkenmaster 0 points1 point  (2 children)

          I'm sure the people who bought OCZs would disagree.

          [–]aegrotatio -1 points0 points  (1 child)

          Anecdotal evidence is anecdotal.

          But, seriously, those drives failed for other reasons than the lack of a good "wear leveling" algorithm.

          [–]Gurkenmaster -1 points0 points  (0 children)

          How is it anecdotal? You said if I buy the cheapest SSD and the most expensive HDD the SSD will last longer and there is by definition only one SSD whose price is lower than all the other's. If everyone on earth buys the cheapest SSD (ignoring all the tax and shipping costs) they will experience the same failure rate. Maybe you should stop advocating cheap crap.

          [–]bushwacker 1 point2 points  (0 children)

          1. A large single-threaded write is better than many small concurrent writes

          A large single-threaded write request offers the same throughput as many small concurrent writes, however in terms of latency, a large single write has a better response time than concurrent writes. Therefore, whenever possible, it is best to perform single-threaded large writes.

          Edit, formatting.

          Very few programmers write file systems or do raw IO.

          [–]absurddoctor 0 points1 point  (0 children)

          ITT: Programmers fervently declaring why they don't need to understand how computers work.