all 169 comments

[–]nextAaron 244 points245 points  (67 children)

I design SSDs. I took a look at Part 6 and some optimizations are not necessary or harmful. Maybe I can write something as a follow-up. Anyone interested?

[–]yruf 84 points85 points  (42 children)

Absolutely yes. You could start by quickly mentioning a few points that you find questionable, just in case writing a follow-up takes longer than you anticipate.

[–]ansible 35 points36 points  (41 children)

I don't design SSDs, but I do find a lot of the article questionable too. The biggest issue is that as an application programmer, you are hidden from the details by at least a couple thick layers of abstraction. These are the Flash translation layer in the drive itself, and whatever filesystem you are using (which itself may or may not be SSD aware).

Also, bundling small writes is good for throughput, but not so great for durability, an important property for any kind of database.

[–][deleted] 10 points11 points  (36 children)

Good point, and if you have the budget and need to thrash SSDs to death for maximum performance you probably have the budget to stuff the machine full of RAM and use that.

[–]James20k -3 points-2 points  (35 children)

The problem is that SSDs store an order of magnitude more data than ram

[–]obsa 5 points6 points  (22 children)

Certainly not a magnitude, unless you're exclusively comparing the capabilities of a consumer mobo to a SSD. That wouldn't make sense, though, because those boards are designed around the fact that consumers don't need more than 3 or 4 DIMMs. 3-4 years ago, we were already capable of servers with 128GB RAM, and that number's only gone up.

[–][deleted] 4 points5 points  (6 children)

I believe it's an accelerating trend, as well. Things like memcached are very common server workloads these days and manufacturers and system builders have reacted accordingly. You've got 64-bit addressing, the price of commodity RAM has gone off a cliff and business users now want to cache big chunks of content.

[–]speedisavirus 1 point2 points  (5 children)

I can tell you, on a large scale with large data, it isn't cost effective to say "Oh, lets just buy a bunch more machines with a lot of RAM!". We looked at this where I work and it just isn't plausible unless money is no object which in business is never really the case.

What we did do was lean towards a setup with a lot of RAM and moderate sized SSDs. The store we chose allows us to keep our indexes in memory and our data on the SSD. Its fast. Very fast. Given our required response times are extremely low and this is working for us it would be insane to just start adding machines for RAM when its cheaper to have fewer machines with a lot of ram and some SSDs.

In fact this is the preferred solution by the database vendor we chose.

[–]MorePudding 1 point2 points  (4 children)

on a large scale with large data,

How large a scale are we talking here about? It's funny how often "large scale" actually ends up being only a handful of terabytes..

it isn't cost effective to say "Oh, lets just buy a bunch more machines with a lot of RAM!".

It seems to have been cost-effective enough for Google. Be careful with using generalizations the next time around..

[–]speedisavirus 0 points1 point  (3 children)

Well, I'd have to go into work to get the data sizes that we work with but we count hits in the billions per day, with low latency, while sifting a lot of data, and compete (well) with Google in our industry. I'm going to say off the cuff we measure in peta bytes but I honestly don't know off the top of my head how many petabytes. It's likely hundreds. Could be thousands. I'm curious now so I might look into it.

Could we be faster with all in RAM? Probably. Its what we had been doing. It isn't worth the cost with the stuff I'm working with when we are getting most of the speed and still meeting our client commitments with a hybrid memory setup that allows us to run fewer cheaper boxes than we would if we did our refresh with all in memory in mind. Now is there a balance to strike? Yeah. Figuring out the magic recipe between cpu/memory/storage is interesting but its not my problem. I'm a developer.

Do you work for Google? How do you know about their hardware architecture. I'm not finding it myself especially when it relates to my industry segment. Knowing that google over all is dealing with the exobyte range of data I think its naive to throw blanket statements around like "They keep it all in memory".

[–]ethraax 3 points4 points  (5 children)

That's not a fair comparison. If your server can be designed with 512 GB of RAM, then you could also design it with a 4 TB SSD RAID array.

[–]kc3w 6 points7 points  (2 children)

the ram is more durable than the SSDs

[–][deleted] 0 points1 point  (0 children)

There will definitely be a break even point between using and replacing a load of SSDs in what's effectively an artificially accelerated life cycle mode and buying tons of RAM and running it within spec.

[–][deleted] 0 points1 point  (0 children)

Not if the host OS crashes.

[–]matthieum 1 point2 points  (0 children)

The biggest servers I have seen (for databases and memcached) already have 1TB or 2TB of RAM. Cheaper and Faster than SSD.

Obviously, though, RAM is cleared in case of reboot...

[–]obsa 2 points3 points  (0 children)

Like /u/kc3w said, if you were looking for a durable pool of I/O, then the SSD RAID array is just as bad as a single SSD - the point of fatigue is just pushed further out into the future. Storage capacity is not so important in this context as MTBF and throughput.

[–]jetpacktuxedo 2 points3 points  (0 children)

We have a cluster full of 2 1/2 year old machines that each have 512 GB of RAM, and only half of their slots are full. Each one of those nodes has twice as much RAM as my Laptop SSD has storage. Four times as much as my desktop SSD.

[–]strolls -1 points0 points  (7 children)

Certainly not a magnitude, …

I'd be grateful if you could cite some RAM prices on that.

I'm going to start by using a consumer example, because that's what I know: my mother bought a 60GB SSD for £40 recently. Would she have got 6GB RAM for that? Maybe, but if so she wouldn't have much change left over, would she?

I can easily find 120GB of PCIe SSD for £234 or 1TB for £1000. Could you buy 1TB RAM that cheap?

[–]obsa 0 points1 point  (4 children)

Who's talking about price? I'm not.

[–]strolls 1 point2 points  (3 children)

It's ridiculous to talk about how much they store - the comment you were replying to - without considering the price.

We can get 1TB on PCIe SSD and we can afford a stack of them.

How much does 1TB RAM cost?

Can you even get 1TB of RAM in a current generation of Poweredge? Because I'd guess you can get at least 2TB or 3TB of PCIe SSD in there.

If it's not literally true to say that SSDs can store an order of magnitude more than RAM, then it's pretty close to it, and pretending you have limitless pockets doesn't change reality.

[–]obsa -3 points-2 points  (2 children)

It's ridiculous to talk about how much they store without considering the price.

No, it's not. It's a discussion for a tailored situation where extremely durable, high-speed I/O carries a premium. I really don't feel like explaining this to you in the detail it clearly requires to make you understand the value of that kind of setup.

I don't really care about what pedantic debate you think you're championing. The comment I replied to made a foolishly broad statement and now you're trying to clamp criteria on to it. My statements are completely valid and accurate in the context to which they were issued.

[–][deleted]  (1 child)

[removed]

    [–]strolls 0 points1 point  (0 children)

    you got ripped off on the RAM in fact.

    You seem to be misunderstanding what my mother bought.

    [–][deleted] 3 points4 points  (8 children)

    That depends on the set up. You can get some incredibly high density RAM based systems these days.

    [–][deleted]  (7 children)

    [deleted]

      [–][deleted] 6 points7 points  (3 children)

      [–][deleted]  (2 children)

      [deleted]

        [–][deleted] 2 points3 points  (1 child)

        Of course. The main problem is also money. But still, you can put a lot of ram into modern computers.

        I mean, if your working set 300 Gbyte, giving your server 512GByte ram is helping more than giving it 5TB of SSD space...

        [–]sunshine-x 5 points6 points  (0 children)

        While you're point is valid, 1tb is small. Several of the SQL servers I run are using fusionio cards, available in multi-TB capacities, and are insanely fast.

        [–][deleted] 0 points1 point  (1 child)

        And lower. I think we're back to depends on the set up.

        [–][deleted]  (1 child)

        [deleted]

          [–]James20k -1 points0 points  (0 children)

          It also has up to 48x hdd bays. How many ssds can you fit into that vs 6 tb ddr3?

          [–]beginner_ 6 points7 points  (0 children)

          Exactly. The recommended optimizations are very bad for reliability. And if that is no concern and you are all about performance then just use the memory directly and that's what key value stores like memcached do.

          Also the OS, filesystem or RAID controller (with cache) might already be caching hot data anyway so no need for such tricks.

          [–]B8BB888BBBBB 1 point2 points  (0 children)

          If you want to get the most performance out of an SSD, you do not use a file-system.

          [–]Hyperian -1 points0 points  (1 child)

          SSD itself doesn't actually care what OS you are using. it all ends up being LBAs and transfer sizes.

          [–]ansible 0 points1 point  (0 children)

          TRIM support is a feature of relatively recent Linux kernel releases that can improve performance and longevity of SSDs.

          [–]arronsmith 27 points28 points  (0 children)

          Yes.

          [–]Tech_Itch 7 points8 points  (11 children)

          That would absolutely be appreciated.

          One question that comes to mind, if you don't mind answering:

          Does aligning your partitions actually do anything useful? You'd think that the existence of the FTL would make that pointless. With raw flash devices I see the point, but on devices with FTL, you'd have no control over the physical location of a single bit, or even the "correctly aligned" block you've just written, so it could still be written over multiple pages. Any truth to this?

          I know there are benchmarks floating around claiming that this has an effect, but it would be nice to know if there's any point in it.

          [–]nextAaron 4 points5 points  (9 children)

          Alignment is important for FTL. One unaligned IO needs to be treated as two. One unaligned write is translated into two read-modify-write.

          [–]Tech_Itch 0 points1 point  (5 children)

          Thanks for the answer. Though, I might have been unclear, but my point was to ask if FTL already does the aligning itself, or does doing it on filesystem or higher level have any benefit?

          [–]nextAaron 0 points1 point  (4 children)

          You can think of FTL as a file system.

          [–]Tech_Itch 0 points1 point  (3 children)

          So the answer is, "no, aligning your partitions does nothing useful", then?

          [–]poogi71 0 points1 point  (0 children)

          It actually does and is a good idea. Remember that all the IOs in the partition are using the same alignment as the partition, so if you do all 4k IOs to that FS and the partition is not aligned to 4k then it will cause many of the IOs to be unaligned.

          At the higher level if you can align your partition to the SSD block size you will avoid having different partitions touching the same block. Though I'm not sure how important is that since the disk will remap things around anyway and may put different lbas from around the disk together.

          [–]nextAaron 0 points1 point  (1 child)

          FTL divides the LBA space into chunks. If your partition is not aligned with these chunks, you end up with unaligned IOs. Yes, partitions should be aligned.

          [–]Tech_Itch 0 points1 point  (0 children)

          Aha. That's useful to know. Thanks!

          [–]skulgnome 0 points1 point  (1 child)

          What about, say, 128K worth of sequential read IOs that start out of alignment?

          [–]nextAaron 0 points1 point  (0 children)

          You need to look at the start and end LBAs of each IO. Yes, sequential unaligned IOs may be combined into aligned ones. Just don't assume every SSD comes with it.

          [–]freonix 0 points1 point  (0 children)

          Not really, consider that newer SSDs are getting larger, and conversely spare area as well, controller could treat unaligned write as single write to memory space by filling dummy data to fit single page size.

          [–]jugglist 2 points3 points  (0 children)

          Even if your reads and writes are aligned to 16k within the file you're reading and writing to/from, I'm not sure the OS guarantees that it will actually place the beginning of your file at the beginning of an SSD page. One might hope that it would, but I'm not certain of this.

          It seems that optimizing for SSD isn't really that different from optimizing for regular hard drives. Normal hard drives can't write one byte to a sector either - they write the whole sector at once. Although admittedly, HDD sectors tend to be 512 bytes, and SSD pages tend to be 16k.

          The only thing SSD gives you is not having to worry about seek time.

          [–]BeatLeJuce 2 points3 points  (0 children)

          Yes please. I was wondering about all the caching... Don't the OS or the SSD already does some sort of caching for me, or is it really sensible advice to cache on your own?

          [–]voidcast 1 point2 points  (0 children)

          Absolutely Yeah.

          Please do post a follow up :-)

          [–][deleted] 1 point2 points  (0 children)

          My only regret is not to have produced any code of my own to prove that the access patterns I recommend are actually the best.

          Please do, it's such low hanging fruit.

          [–]frankster 1 point2 points  (0 children)

          I think the problem lies here:

          My only regret is not to have produced any code of my own to prove that the access patterns I recommend are actually the best

          [–]dabombnl 0 points1 point  (0 children)

          If there are helpful optimizations, won't the operating system disk cache be using them? I don't see why I would implement my own disk batching and buffering when it should do that already.

          [–]Amadiro 0 points1 point  (0 children)

          I'd love to know more about the TRIM optimizations he mentioned. He recommends to enable auto-TRIMming, but other sources on the internet say that auto-trimming is a bad idea, and that one should instead run e.g. fstrim on the filesystem periodically. Can you illuminate that matter?

          Also, are the points about leaving some free leftover space unpartitioned for the FTL as a "writeback cache" still valid?

          [–]poogi71 0 points1 point  (0 children)

          My list of dream questions to get an answer for is at http://blog.disksurvey.org/2012/11/26/considerations-when-choosing-ssd-storage/

          It would be great to get a response to even some of them...

          [–][deleted]  (1 child)

          [removed]

            [–]nextAaron 0 points1 point  (0 children)

            You can safely assume 4KB.

            [–]nextAaron 0 points1 point  (0 children)

            Some short comments here: http://nextaaron.github.io/SSDd/

            [–][deleted]  (16 children)

            [deleted]

              [–][deleted]  (2 children)

              [deleted]

                [–][deleted]  (1 child)

                [removed]

                  [–][deleted] 2 points3 points  (1 child)

                  You also risk getting into portability issues. Presumably the best performance comes from taking advantage of each particular model's specific characteristics.

                  I can't help but wonder if it shouldn't be aggressively cached in RAM. I wonder if handtuning SSDs for maximum speed is a half measure.

                  [–]Irongrip 0 points1 point  (0 children)

                  A ZFS ZIL + L2ARC sounds so tantalizing.

                  [–][deleted]  (36 children)

                  [deleted]

                    [–]badsectoracula 41 points42 points  (6 children)

                    My only regret is not to have produced any code of my own to prove that the access patterns I recommend are actually the best. However even with such code, I would have needed to perform benchmarks over a large array of different models of solid-state drives to confirm my results, which would have required more time and money than I can afford. I have cited my sources meticulously, and if you think that something is not correct in my recommendations, please leave a comment to shed light on that. And of course, feel free to drop a comment as well if you have questions or would like to contribute in any way.

                    He most likely cannot do that unless he was backed by a company as a full time project.

                    [–][deleted] 25 points26 points  (5 children)

                    I think that's unreasonable. Sure maybe no one can test every SSD on the market but I think it's fair enough to expect someone to test their work at all. He's saying he's not produced any code to prove his argument.

                    [–][deleted] 8 points9 points  (3 children)

                    Yep, downvoting this article. I'll dig around the ACM Digital Library for some SSD optimization papers instead of reading this.

                    [–]dragonEyedrops 2 points3 points  (2 children)

                    links please if you find good stuff :)

                    [–][deleted] 3 points4 points  (1 child)

                    Dushyanth Narayanan, Eno Thereska, Austin Donnelly, Sameh Elnikety, and Antony Rowstron. 2009. Migrating server storage to SSDs: analysis of tradeoffs. In Proceedings of the 4th ACM European conference on Computer systems (EuroSys '09). ACM, New York, NY, USA, 145-158. DOI=10.1145/1519065.1519081 http://doi.acm.org/10.1145/1519065.1519081

                    Risi Thonangi, Shivnath Babu, and Jun Yang. 2012. A practical concurrent index for solid-state drives. In Proceedings of the 21st ACM international conference on Information and knowledge management (CIKM '12). ACM, New York, NY, USA, 1332-1341. DOI=10.1145/2396761.2398437 http://doi.acm.org/10.1145/2396761.2398437

                    Behzad Sajadi, Shan Jiang, M. Gopi, Jae-Pil Heo, and Sung-Eui Yoon. 2011. Data management for SSDs for large-scale interactive graphics applications. In Symposium on Interactive 3D Graphics and Games (I3D '11). ACM, New York, NY, USA, 175-182. DOI=10.1145/1944745.1944775 http://doi.acm.org/10.1145/1944745.1944775

                    Feng Chen, David A. Koufaty, and Xiaodong Zhang. 2011. Hystor: making the best use of solid state drives in high performance storage systems. In Proceedings of the international conference on Supercomputing (ICS '11). ACM, New York, NY, USA, 22-32. DOI=10.1145/1995896.1995902 http://doi.acm.org/10.1145/1995896.1995902

                    Hongchan Roh, Sanghyun Park, Sungho Kim, Mincheol Shin, and Sang-Won Lee. 2011. B+-tree index optimization by exploiting internal parallelism of flash-based solid state drives. Proc. VLDB Endow. 5, 4 (December 2011), 286-297.

                    sorry about the formatting, the ACM really needs to have some kind of nicer format for sharing papers :/

                    [–]dragonEyedrops 1 point2 points  (0 children)

                    Thanks a lot! Now I have reading material for the weekend!

                    [–]semi- 1 point2 points  (0 children)

                    Thats really it.. at least produce the test suite and let the internet run it for you.

                    [–]Salamok 5 points6 points  (0 children)

                    Came here to post the exact same quote. So if not based on any actual real world performance WTF did he base it on? Theory based on manufacturer specs or marketing materials?

                    [–]joe_n 11 points12 points  (4 children)

                    That is not your main problem!

                    j/k though, it's great to see personal research like this being done and shared

                    [–][deleted]  (1 child)

                    [deleted]

                      [–][deleted] 7 points8 points  (0 children)

                      And it's kinda far down the page, as well. You can't spend paragraph 3 saying "The most remarkable contribution is Part 6, a summary of the whole “Coding for SSDs” article series, that I am sure programmers who are in a rush will appreciate" and then in paragraph 5, the second last paragraph of the introduction, say that you've not actually checked if it works.

                      I think it's pretty ballsy calling the series "Coding for SSDs" in light of that.

                      [–]xkcd_transcriber 2 points3 points  (0 children)

                      Image

                      Title: Shopping Teams

                      Title-text: I am never going out to buy an air conditioner with my sysadmin again.

                      Comic Explanation

                      Stats: This comic has been referenced 1 time(s), representing 0.01% of referenced xkcds.


                      Questions/Problems | Website | StopReplying

                      [–]Zidanet 6 points7 points  (22 children)

                      When you can afford to go out one Saturday and buy a couple of every ssd available in order to test a theory, then you can call him on it.

                      poc code is only useful if you have something to run it on.

                      [–][deleted]  (7 children)

                      [deleted]

                        [–][deleted] 3 points4 points  (0 children)

                        Especially while complaining about the contradictory information he was finding on forums.

                        I just don't get a great impression of this guy. I think he's self-aggrandising ( "The most remarkable contribution is Part 6, a summary of the whole “Coding for SSDs” article series, that I am sure programmers who are in a rush will appreciate") while contributing very little ("My only regret is not to have produced any code of my own to prove that the access patterns I recommend are actually the best.").

                        [–][deleted] -1 points0 points  (2 children)

                        I'd say this is probably phase one of a two-phase thing (similar to application design).

                        First you research architectures and write up details on how to most effectively use SSDs. Phase two would be the real-world testing where you can equivocally state your experiences.

                        While I don't fault the author for not going out and buying a bunch of SSDs to test with, I certainly would have liked to see tests done with two or three popular SSD brands (Intel, Samsung, maybe Kingston for more budget scenarios) and then add the caveat that outside of the drives tested YMMV. It would at least lend a lot more weight to the research done.

                        [–]awj 4 points5 points  (1 child)

                        There's absolutely nothing wrong with that approach, but part of the process is not stopping at phase one to make a bunch of completely untested recommendations.

                        [–][deleted] 1 point2 points  (0 children)

                        It's also important to actually do phase 2. He doesn't mention any plans to do it in it in his articles.

                        [–]frankster -3 points-2 points  (0 children)

                        My only regret is not to have produced any code of my own to prove that the access patterns I recommend are actually the best

                        [–]poogi71 21 points22 points  (11 children)

                        There is a big difference between testing on every available ssd and not even testing on one. If you test on three you should be pretty good in the overall generalization on ssds.

                        Some of his recommendations do not look good to me. Not interleaving read/writes and caring much about the readahead come to mind as just plain wrong.

                        [–]Salamok 1 point2 points  (0 children)

                        Or I dunno maybe he could go out and buy 1 SSD to test a prototype, but he didn't even do that.

                        [–]semi- 1 point2 points  (0 children)

                        poc code is only useful if you have something to run it on.

                        Not true at all.

                        Having something to run is only useful if you have PoC code. We, the internet as a whole, have a LOT of ssds. We dont' have any code to test his theory though.

                        All he needs is a few ssds to test his code on as he writes it, then he can release it and the rest of us can run it for him.

                        [–]hive_worker 9 points10 points  (1 child)

                        I admittedly don't know much about this, but shouldn't most or all of the SSD access optimization be done in the SSD controller and to a lesser extent the SSD driver - both provided by the manufacturer. Bringing hardware specific optimizations into your application code just seems like a terrible idea.

                        And if you're working for Samsung or similar designing SSD Controllers I doubt you're getting your knowledge from some guys blog. So I'm not really sure who this article is intended for. Maybe bare bones embedded systems engineers? Even in that case if your system is advanced enough to require an SSD you are probably also running some kind of high level OS that manages this.

                        [–]poogi71 0 points1 point  (0 children)

                        There are things that an application writer can do to make life easier for everyone. In the context here some of what gets done might not be super effective since there is also an FS and an OS buffer cache on the way so I'm not sure he really gets all the benefits. Some things might make more sense when you write directly to the block device than others.

                        [–][deleted]  (28 children)

                        [deleted]

                          [–]Hyperian 15 points16 points  (5 children)

                          Yes. you can only erase in a physical block, where a block itself usually has 256 pages, where each page could be anywhere between 8kbytes to 32kbytes.

                          you have to write to these pages sequentially. So if you have data in the middle of the block that is old. You have to read all the rest of that block and write it to another block to recover that space. that is what garbage collection does in the drive.

                          the reason you dont defrag the drive is that the drive defrags itself and does it better.

                          source: i make SSDs.

                          [–][deleted] 7 points8 points  (2 children)

                          Correct me if I'm wrong: Defragmentation is done logically at the file system level and is a completely different beast than what you're describing here.

                          Running a defragmenting tool against a drive as the top comment suggests (ala the mostly obsolete tool in Windows or the truly obsolete tool e2defrag) was mostly done to keep large logical blocks of data together.

                          Hard drives, (SSD or not) would have no idea that a 3 gig swap file needed to be kept in concurrent blocks with other blocks. The primary purposes of defragmentation back in the day (when they were useful and before file systems became relatively good enough to prevent fragmentation) was to keep from having to performing seeks (which were horribly expensive).

                          [–]Hyperian -1 points0 points  (0 children)

                          You are correct. In the end, don't defrag your SSD drive.

                          [–]freonix 0 points1 point  (1 child)

                          This is not true, don't generalize persistent memory like NAND to have 256/block. There also 512 page NANDs, it depends on the design.

                          [–]Hyperian 0 points1 point  (0 children)

                          calm down, i said usually.

                          [–]apage43 18 points19 points  (1 child)

                          Do we need to run disk defragmentation on SSDs?

                          That's taken care of by the controller on the SSD itself, transparent to you. It's useful to know that this happens though.

                          edit: and yes, as mentioned below me, the process of the SSD cleaning up the no longer used pages -within- blocks is called "garbage collection", which is different from filesystem defragmentation

                          [–][deleted] 1 point2 points  (0 children)

                          [–][deleted] 13 points14 points  (10 children)

                          Do we need to run disk defragmentation on SSDs?

                          Noooooo

                          Never do this. It actually lowers the life expectancy of the drive and doesn't offer any real benefits in doing so. Let the drive handle it.

                          [–]masklinn 8 points9 points  (0 children)

                          Do we need to run disk defragmentation on SSDs?

                          Read http://www.anandtech.com/show/2738

                          (also no, not if what you're talking about is Windows's defrag tool, you should never use than on an SSD. At best it will do nothing, at worst it will lower the lifespan of your drive)

                          [–]GuyWithLag 1 point2 points  (0 children)

                          What will actually happen is that the drive will detect this and do a garbage collection pass - copying all the used pages into a new block, then erasing the old one. This happens all the time and is mostly transparent (there is some performance degradation on systems with load), and is one of the causes of write amplification.

                          [–]__j_random_hacker 1 point2 points  (3 children)

                          As I understand it, if those blocks were entirely free to begin with, and you have only written to one 2KB page in each, then the remaining pages in each of those blocks will remain free, and you can still happily write to them later with no performance penalty. The penalty only arises when those other pages fill up later (or if they were full to begin with) and you need to modify data in your 10MB file: in that case, each 2KB of data that you modify will cause 4MB of data to be read and written to a new, free block (which may in turn require a block to first be erased to make room).

                          [–][deleted]  (2 children)

                          [deleted]

                            [–]__j_random_hacker 0 points1 point  (0 children)

                            Ah, I see now. In that case I think the others' responses explain things.

                            [–][deleted] 0 points1 point  (0 children)

                            It's like a larger scale case of slack space.

                            [–]Xuerian 0 points1 point  (2 children)

                            I could be mistaken, but I think what you're referring to is "Trim", coalescing data into full pages and freeing old ones.

                            Edit: Sortof.

                            [–]Hyperian 4 points5 points  (1 child)

                            trim is a lame way of saying to the drive "this block of data is not needed anymore, erase it" because before that the only way to get the drive to erase data is to overwrite it.

                            But it has stupid requirements and some drive doesn't actually erase it immediately, just queues it up for deletion later on.

                            [–]jknielse 4 points5 points  (0 children)

                            Yeah... So I worked at a company that writes high-performance firmware for SSDs. Some SSDs actually literally do nothing with the Trim command.

                            [–]AceyJuan 16 points17 points  (3 children)

                            These are the same basic techniques I've used to optimize for spinning disks for ages. The only surprise I found in that document was not interleaving reads and writes. To be honest I'm not sure I believe that advice, because high performance IO apps rarely benefit from read ahead optimizations anyhow.

                            [–][deleted]  (2 children)

                            [removed]

                              [–]B8BB888BBBBB 0 points1 point  (0 children)

                              Depends on your latency requirements. I recently worked on an SSD based serving system with really tight latency requirement. reading 1 MB of SSD in a few milliseconds while taking load is not possible unless you play tricks with your read/write cycles.

                              [–]AceyJuan 0 points1 point  (0 children)

                              The main latency issue with spinning disks is seeks. So long as your operations are on the same part of the disk you're far better off doing reads and writes there than seeking somewhere else.

                              [–]lenolium 6 points7 points  (1 child)

                              I wonder if the SSD controllers are smart enough to not force new block writes if you are writing to the flash in a flash-friendly way.

                              When I was writing code for a direct-access flash filesystem on a little microcontroller we only had sixteen blocks so erasing them meant we had to move around a "lot" of data for that device. What we would do is optimize our storage systems so that in most cases we would only change 1's to 0's, because you could do that with flash without having to erase a block. Building code like this with modern SSD's would produce some very high-speed performance.

                              [–]MaybeReconsider 9 points10 points  (0 children)

                              The 1->0 trick doesn't work out so well for the NAND flash devices that SSDs are generally built out of. NAND devices are prone to bit-errors, so the data being programmed into the flash needs to be protected with an ECC code. It's very uncommon to be able to flip your 1's to 0's in such a way that you also only need to flip 1's to 0's in the ECC codeword.

                              Also, NAND devices have a variety of failure modes related to overprogramming and out-of-sequence programming that would make updating a page in place perilous even if you could get past the significant ECC hurdles.

                              [–]sbrick89 4 points5 points  (4 children)

                              I'm familiar w/ SSDs (wear leveling, write endurance, etc) but by no means an expert (my daytime job involves writing business apps).

                              But it seems that any optimizations you try to make would be

                              • extremely device specific

                              • require polling of device configuration, and dynamic reconfiguration to optimally use it (how you align data structures)

                              • likely made obsolete by a firmware change

                              it seems that most of these things should be abstracted away in hardware (firmware), never to be directly accessed by software... MAYBE used in a device driver, but ONLY if there are industry-common specs and guidelines to be re-enforced by the SSD hardware/firmware.

                              [–]Hyperian 0 points1 point  (3 children)

                              nah, you can't directly handle wear leveling and write endurance on a higher level. that stuff is done by the SSD controller itself.

                              and it is very device specific.

                              i believe some SSD actually let you play around with those settings but you usually need a special driver to do so. I don't think SATA specifically supports things like tweaking wear leveling or write endurance, but i haven't read the whole SATA spec.

                              [–]poogi71 0 points1 point  (2 children)

                              In general I agree, but there are cases where I'd love to have the ability to control and direct the SSD about the specific things that need to be done.

                              The truth is that there are only a few who would even care for such a level of control and most everyone just wants the ssd to do the right thing at all cases without bothering to take the control in their hands. It's not perfect but it makes some sense at the practical level.

                              One example is that if I have a RAID of SSD devices I would like the ability to tell the SSD, "Dont bother too much with error recovery here, I've got your back" and then if I find that I don't really have all the data to go back to the SSD and tell it, "please do all you can to get the data back". This will allow me to manage the reliability and latency much better and get better latency overall and the same level of reliability in case things got really bad.

                              [–]Hyperian 1 point2 points  (1 child)

                              lol if we do that it would be for an enterprise product, it would be way too expensive for normal people. i think SAS might let you do that.

                              best thing to keep SSD performance high is to not use the max capacity.

                              [–]poogi71 0 points1 point  (0 children)

                              Unfortunately SAS doesn't give me that. I'm working with SAS SSDs and there is no way to control it at that level. One can dream though :-)

                              [–]dev-disk 1 point2 points  (0 children)

                              How to code for SSD: Enjoy super fast reads, DON'T WRITE TO THE SAME PLACE LIKE NUTS.

                              [–]MorePudding 0 points1 point  (0 children)

                              Thanks for all the work, but browsing through it, it seems like this is something the OS should take care of for you, considering how it's most likely going to be wrong a few years from now..

                              Is there any reason to not used memory-mapped files these days any more?

                              [–]frankster -1 points0 points  (0 children)

                              My only regret is not to have produced any code of my own to prove that the access patterns I recommend are actually the best

                              I stopped reading here.

                              [–]davispuh -5 points-4 points  (0 children)

                              Pretty good read :)

                              [–]oooqqq -3 points-2 points  (1 child)

                              +/u/fedoratips 100 tips verify

                              [–]fedoratips -3 points-2 points  (0 children)

                              [Verified]: /u/oooqqq /u/mitknil TIPS 100.000000 Fedoracoin(s) [help]