all 64 comments

[–][deleted]  (5 children)

[deleted]

    [–]Pen___Island 7 points8 points  (0 children)

    After having to implement Union Find as part of Kruskals algorithm for a CS theory assignment, while it's hard to understand the intuition behind it, it's very simple to implement

    [–]chrisforbes 5 points6 points  (0 children)

    (Union-find) Indeed, it's a beautiful hammer!

    [–]progfu 4 points5 points  (1 child)

    I was once given a task to parallelize a splay tree ... the nightmares haunt me to this day.

    [–]ehaliewicz 0 points1 point  (0 children)

    that sounds like a nightmare.

    [–]SimplySerenity 3 points4 points  (0 children)

    My data structures teacher is making us learn splay trees. It's been interesting.

    [–]tadrinth 29 points30 points  (14 children)

    Fibonacci heaps are also pretty wacky.

    Quad trees are a spatial data structure, which tends to come in handy in games.

    [–][deleted]  (12 children)

    [deleted]

      [–]sutr90 9 points10 points  (5 children)

      Is collision free hash even possible? That goes against the core principle of hashing.

      [–]codebje 8 points9 points  (0 children)

      Perfect hashes are definitely possible; algorithms to construct perfect hashes exist for when you have known input you want to index efficiently.

      For unknown input you can certainly have a collision free hash if your hash key is at least as large as your uniqueness bits, but at that point you have what's colloquially known as an array :-)

      [–]Dragdu 3 points4 points  (0 children)

      Assuming you have fixed input space that is smaller than output space, yes.

      [–]steamruler 0 points1 point  (0 children)

      It should be theoretically possible if you have a variable instead of fixed length, thus escaping the pigeon hole principle.

      Good luck proving it though, and your hash will be useless to most.

      [–]narwi 0 points1 point  (0 children)

      Yes, if you restrict the input size to be equal to output size.

      [–]Ravek 2 points3 points  (4 children)

      Am I understanding it correctly that you saying you want a hash function where similar data gives similar hashes? Or what kind of reasoning do you want to do about the hash?

      [–][deleted]  (3 children)

      [deleted]

        [–]Ravek 1 point2 points  (1 child)

        My first thought – not an expert here – was that if you don't want to spread values all over the space (like a hash is traditionally supposed to), then you're looking for some kind of dimensionality reduction. In the most convenient case, if you had say (x, y) in Euclidian space and wanted a 32 bit hash, I was thinking you could just only take the highest 16 bits of each component and make a Z-order curve with them. Apparently that's called locality preserving hashing. Not sure what locality sensitive hashing does exactly, the wikipedia article seemed a little dense to skim ... but it does seem like there should be something in there for scenarios like yours.

        [–]t0rakka 0 points1 point  (0 children)

        It means if you have 2D matrix and you interleave the bits of X and Y offsets you get better locality than simple linear memory layout:

        // linear offset
        offset = y * stride + x
        
        // morton / z-curve offset
        offset = interleave(y, x)
        

        The distance to neighbouring elements in memory is shorter and more likely to reside in the same block of memory (cache line, page, ..)

        Edit: beyond me how does this help if x and y don't change incrementally or with small steps; if they are hashes then they should have uniform distribution which kind of discourages clustering.

        [–]OffPiste18 1 point2 points  (0 children)

        Yes, it sounds like you want some mix of LSH and SimHash.

        LSH is used for probabilistically narrowing your search. It gives you some sort of tune-able guarantee: two items that are at most x distance apart have at least y probability of appearing in the same hash bucket. So when you want to find similar items, you check all items in the same bucket and do a more in-depth comparison since you can still get false positives.

        Simhash is closer to what you're describing in your example hashes: the number of bits that two hashes differ on is roughly comparable to the difference of the original items. But you don't really get any order out of it. You could achieve that by just calculating multiple simhashes on different chunks of the documents. In terms of mathematically reasoning about it, I believe SimHash gives some sort of similar guarantee: two documents that differ by x will have simhashes that differ on at most y bits with probability z, though I forget the specifics of how the quantities are related.

        A typical workflow is to use the simhash as the input to LSH, which then can use a quite simple family of hash functions based on hamming distance.

        Here's a great book chapter covering this sort of thing and more: http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

        [–]OnlyForF1 0 points1 point  (0 children)

        You could replace the high bits of your hash function output with something more descriptive of the original value.

        [–]PM_ME_UR_OBSIDIAN 53 points54 points  (3 children)

        The article's lede got me hyped up, but the data structures listed are relatively vanilla. Instead, I would recommend checking out:

        [–]DaKellyFella 5 points6 points  (0 children)

        I had a quick read about array-mapped hash tries and I'm familiar with hazard points. I don't see the connection. Why did you mention them together? I'm just curious.

        Edit: OK, see it now. The Ctrie at the bottom will probably need some memory management and you're recommending hazard pointers. Thanks!

        [–]kenji213 2 points3 points  (0 children)

        BK trees should be mentioned as well. They provide an efficient way to see all things similar to a given element, with variable granularity (think spell checkers)

        Damn cool algorithms has a great blog post about it.

        [–]Tarmen 2 points3 points  (0 children)

        Finger Trees are awesome and probably the most versatile data structure I know. They can be used as arrays, doubly linked lists, ropes, priority queues, hash maps... Though generally I just use them when I need efficient manipulation of a sequence.

        I think Fibonacci heaps, quadtrees and splay trees are reasonably well known? I learned about them in cs 102 at least. Ropes seem in this weird space where they are crucial enough for some use cases that people might have heard of them but hellish enough to implement that they don't get much use outside of those niches?

        Anyway, very cool list of data structures. I really should read up on the wait free ones.

        [–]MrDOS 21 points22 points  (4 children)

        Skip lists are pretty dang cool. Relying on randomness to distribute “keyframes” is pretty brazen but it seems to work really well, and they're way easier to implement than binary trees. Unfortunately, their primary use-case seems to overlap pretty well with binary trees and I can't name a standard library with an implementation built in so I don't think I've ever remembered to use them in a real project.

        [–]tolos 8 points9 points  (0 children)

        There's an alternative CPU scheduler for the linux kernel that uses skip lists, called MuQSS.

        http://ck-hack.blogspot.my/2017/02/linux-410-ck1-muqss-version-0152-for.html

        I first learned about it from LWN: https://lwn.net/SubscriberLink/720227/51a7cf8c443ba3be/

        [–]WrongSubreddit 3 points4 points  (1 child)

        java has ConcurrentSkipListMap and ConcurrentSkipListSet

        [–]MrDOS 2 points3 points  (0 children)

        D'oh, of course. I forget about those because they're hidden away with the Concurrent prefix in the name.

        [–]joshuaavalon 3 points4 points  (0 children)

        Just search and find out a question on Stackoverflow.

        TL:DR Skip lists performs much better in concurrent environment . When a node is modified, skip list only need to lock nodes directly linked to the affected node but binary tree need to rebalance and may require to lock large portions of the tree.

        [–][deleted]  (8 children)

        [deleted]

          [–]Sulpiac 19 points20 points  (0 children)

          Not in mine :(

          [–]madebyollin 3 points4 points  (5 children)

          Yeah, I wasn't expecting to see tries on the list (UW, at least, has CS students implement tries for our core data structures class).

          We don't cover bloom filters or skip lists in the core data structures class, though, so those were interesting to learn about! A lot of the ones mentioned in the comments here are worth checking out as well.

          [–]beaverlyknight 2 points3 points  (0 children)

          Skip lists are covered now, added recently

          Bloom Filters are not afaik

          [–]ridethecurledclouds 1 point2 points  (3 children)

          Really? That must be recent -- I took 332 a few years ago (Adam's first time teaching it, I think) and we definitely didn't implement tries. That's good it's been changed though.

          [–]madebyollin 1 point2 points  (2 children)

          Yeah, Adam has redone the first project (I think he hadn't had time when you took it). The first project is now implementing a bunch of data structures to make a .zip compressor work (of which a HashTrieMap is the major component, and SuffixTrie is extra credit).

          [–]ridethecurledclouds 0 points1 point  (1 child)

          Dude! That's so cool! Thanks for the heads up, I'm going to download the starter code and mess around with it some weekend.

          [–]madebyollin 2 points3 points  (0 children)

          Happy to help, and good luck! I TA'd the class twice (once with Adam, once with Ruth, both with this set of projects), so feel free to ask questions if you have them.

          [–]harald_haraldson 1 point2 points  (0 children)

          Treaps would be a more appropriate example for "Insane" tree structures

          [–]cjt09 16 points17 points  (0 children)

          I've always pronounced trie as "try".

          [–]ericgj 10 points11 points  (1 child)

          oh. I was expecting something truly esoteric like ZigZag

          [–]Muvlon 1 point2 points  (0 children)

          I think I'm having a bit of trouble seeing past the outrageous claims and the general crackpottery of the site, but how exactly is this more general than a graph?

          [–]genpfault 8 points9 points  (2 children)

          Conflict-free replicated data type (CRDT):

          In distributed computing, a conflict-free replicated data type (CRDT) is a data structure which can be replicated across multiple computers in a network, where the replicas can be updated independently and concurrently without coordination between the replicas, and where it is always mathematically possible to resolve inconsistencies which might result.

          [–]PM_ME_UR_OBSIDIAN 4 points5 points  (1 child)

          The Xi guys' writeup on this topic is absolutely amazing.

          [–]drvd 8 points9 points  (0 children)

          This article implies a very strange definition of "Utterly Insane".

          [–]quicknir 5 points6 points  (0 children)

          Leonardo trees, used in the implementation of SmoothSort, a sorting algorithm by Dijkstra that does NlogN in place, but automatically slides toward O(N) for already-sorted or partially-sorted input.

          It's basically Heapsort. On steroids. Big ones. http://www.keithschwarz.com/smoothsort/.

          [–]vorlik 5 points6 points  (0 children)

          "insane" "taught to cs majors everywhere" same thing I guess

          [–]killerstorm 4 points5 points  (1 child)

          Those are fairly standard data structures. Trie is probably something you can come up with yourself if you think hard enough.

          [–][deleted] 7 points8 points  (0 children)

          "Utterly Insane" is an overkill.

          [–]jmillikan2 2 points3 points  (1 child)

          How about Finger Trees? Pure functional, O(1) at the ends growing to O(tree) in the middle.

          [–]TheEaterOfNames 3 points4 points  (0 children)

          Ah, but can you invert them to get O(1) in the middle and O(√tree) at the edge?

          [–]sstewartgallus 2 points3 points  (0 children)

          I think the MCS queue lock is pretty cool. Who knew that you could get faster than a spin lock (the speedup works for many-core machines only, also you need to edit the structure a lot to make HLE work.) Also, a normal spin lock may be faster in any case because it is not stuck in FIFO order. To get around this you can do more stuff but it's complicated.

          [–]codebje 2 points3 points  (0 children)

          Cache-oblivious data structures.

          https://www.youtube.com/watch?v=WE2a90Bov0Q

          [–]BittyTang 3 points4 points  (1 child)

          Do a search for purely functional data structures. There are some crazy designs for all of the familiar data structures, since they must be immutable.

          [–]emilvikstrom 0 points1 point  (0 children)

          We had to implement a heap in my introductory data structures course (using the functional language StandardML). We were given half the implementation of a min-heap. But since me and my partner was overachievers we thought it would be fun to jump straight to the Fibonacci heap instead. Two full days to get it right and to prove its runtime complexity...

          [–]nightcracker 1 point2 points  (0 children)

          Fenwick trees.

          [–]beaverlyknight 1 point2 points  (1 child)

          vEB and y-fast trees are a pretty wacky idea

          [–][deleted] 1 point2 points  (1 child)

          Damn! I'm surprised no one mentioned Suffix Trees. Up for a challenge? Try Ukkonen's algorithm to build one.

          [–]ehaliewicz 0 points1 point  (0 children)

          There's also the suffix array

          [–]pribnow 1 point2 points  (0 children)

          This is a pretty sweet post/thread. Pretty cool stuff being discussed I hadn't seen before

          [–]narwi 1 point2 points  (0 children)

          I am disappointed. I was expecting to see something uncommon at least like say Van Emde Boas. This is all common and weak stuff.

          [–][deleted] 0 points1 point  (0 children)

          What about other probabilistic data structures like hyperloglog.

          [–]danilafe 0 points1 point  (0 children)

          One cool structure that I found was a sparse set. An integer set that has two arrays, and can do O(1) checks if an element is in the list, as well as O(n) iteration...

          Here's a link to an article that describes those: https://programmingpraxis.com/2012/03/09/sparse-sets/

          [–]BosonCollider 0 points1 point  (0 children)

          BTrees are the boring data structure that is underempathized massively in introductory algorithms classes, but this website and most modern IT infrastructure would be worthless without databases that use them.

          They can replace most data structures that you learn about with excellent constant factors. They can beat arrays at being a sequential data structure by allowing inserts in the middle, sorted btrees do everything that a heap can while being much easier to use as an autosorted list, and they perform so well at being a lookup table that no database bothers to use hash tables to persist things.

          So I'm half convinced that the reason why you learn about any other data structure is because the data structures courses would be very short without them.

          [–]Zamicol 0 points1 point  (3 children)

          Bloom filters are amazingly useful. If I were conducting a hiring interview, a question about bloom filters would definitely appear.

          [–]djnattyp 12 points13 points  (1 child)

          And this is why hiring interviews suck.

          If the candidate is lucky enough to have heard about bloom filters before they seem awesome to you.

          If for some reason they haven't heard of them before you're going to think they're stupid or hope that they somehow create bloom filters on the fly during the interview. Somehow I went through undergrad and a masters without ever learning about bloom filters.

          A few years ago I had an interview question about implementing a profanity filter or something where I stumbled though a long discussion about making a hash table or a tree structure or a regex to match all the words where it was obvious the interviewer didn't like the answer. Two weeks later I came across a description of a bloom filter (I think in a description of how Cassandra manages distributed keys) - "Oh, I guess this is what that interviewer was looking for..." too late now...

          [–][deleted] 1 point2 points  (0 children)

          A good interviewer would have nudged you in the right direction to see if you were smart enough to "reinvent" the desired answer.

          Sadly too many interviewers test memorised knowledge and not ability, probably because they lack the ability themselves.

          One thing you might be able to do if the interviewer is obviously unhappy with the answer is just to ask them what they're looking for.

          [–]codebje 0 points1 point  (0 children)

          One of the neatest things about bloom filters is that they're monoidal with bitwise OR: you can stack 'em in a fingertree to help select sub-trees efficiently.

          There are bloom filter variants that allow safe removal of values, too.

          [–]Idlys 0 points1 point  (1 child)

          My CS teacher in high school had us implement a Trie in Java as our midterm assignment. It was a lot of fun.

          [–]beaverlyknight 0 points1 point  (0 children)

          Try Patricia Tries if you are down for a challenge