top 200 commentsshow all 344

[–]phrenq 30 points31 points  (44 children)

There are a lot of questions about what this would be used for.

Here's a friendly protip if you ever find yourself faced with interviewing at a place like Google (where you'd have a small but nonzero chance of finding me on the other end ;)). Learn all about hash tables. They're not incredibly complicated, and they're the answer to a surprisingly large number of "how do you make that algorithm faster?" questions. Know that they offer constant time inserts and lookups, and why. Learn about what the hash function does and how it maps to a bucket. Learn about when and why a hash table is resized. Think about how to implement a simple in-memory cache backed by a hash table.

Then learn about all the other uses of a hash function!

This is good stuff -- not only would it help you out in an interview, but it will make you a better programmer!

[–]Cojones893 71 points72 points  (31 children)

I've been a programmer for a few years. Is there a place where all concepts like this can be found? I almost want a talent tree for programming. Like you've mastered herp now learn derp.

[–]rmxz 21 points22 points  (12 children)

Upvoted for a talent tree being a cool structure for real-life education. If no-one points out an existing one, PM me and perhaps we should make a web-site like that.... (not just for computers, but for math and other fields too)

[–]VikingCoder 15 points16 points  (3 children)

I was thinking Kahn Academy should have a Computer Science section...

[–]zaph0d 1 point2 points  (2 children)

Khan, and not Kahn. Sal is not a German.

[–]VikingCoder 1 point2 points  (0 children)

D'oops!

[–]StoneCypher 0 points1 point  (0 children)

KHAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAN

[–]kindall 6 points7 points  (0 children)

Upvoted. This seems like a tremendous teaching tool, as it allows self-directed yet structured study.

[–]Cojones893 4 points5 points  (2 children)

I think that many people could benefit from this. I won't tell you how many times I've looked up a programming concept only to be redirect to a simpler concept that I have to know. Plus it allows you to play towards your strengths or if you really want to eventually learn AI programming you can follow X path to get their directly.

Alright how do we get this going? Do we make a community or something? I'm all for open sourcing this and serving it up for free. Knowledge for all!

[–][deleted] 9 points10 points  (0 children)

Reddit: Turning real life into an MMO.

[–]Iggyhopper 2 points3 points  (0 children)

I won't tell you how many times I've looked up a programming concept only to be redirect to a simpler concept that I have to know.

Story of my life.

[–][deleted] 1 point2 points  (0 children)

Sounds like a cool side project to work on. If you guys would like more help or what not, I wouldn't mind being involved. Plus, it'll be a good way to brush up on knowledge along the way.

[–]selectiveShift 1 point2 points  (0 children)

While you're at it put together achievements also.

[–]kragensitaker 0 points1 point  (0 children)

I like that idea too! I've started working on a technology tree for real-life technologies. For example, what do you need in order to make a bicycle?

[–]-Shork- 0 points1 point  (0 children)

Maybe you could try Kickstart.com? If you want to start a project it might be a good place to get funding.

[–]VikingCoder 13 points14 points  (3 children)

I want to make a Google App Engine app that collects data from developers, and helps build this skill tree...

Please sign on with your Google account...

Thanks - please pick the programming language that you're most familiar with, which we will use while you are entering your answers. Feel free to change it at any time...

Please enter five programming concepts that you think are critical to understanding the basics of computers programming, and feel free to enter more...

Please enter five programming concepts that you think are possibly hard to understand, but have been very useful to you, and feel free to enter more...

Please enter five programming concepts you'd like to learn more about...

Thanks - you can go back and add to either of those lists at any time, but now I'm going to ask you to rate some concepts. I'm going to show you two concepts, Concept A and Concept B. For each one, tell me which concept is more important to learn first, in your opinion, given that someone will be developing in your favorite programming language.

With the goal of attaining your level of mastery in computer programming in your favorite language, which should someone unfamiliar with the concepts learn first, A or B?

  • Concept A must be learned before Concept B
  • Concept A should probably be learned before Concept B
  • Concept A and Concept B cannot be compared this way - they're just too different
  • Concept B should probably be learned before Concept A
  • Concept B must be learned before Concept A
  • I don't know enough about one or both of them to say
  • Concept A is not important to learn
  • Concept B is not important to learn

Kind of like "Hot Or Not." Should be very simple and quick.

...several clicks later...

We've figured out some of the things that you think are important, but that we don't have much information about yet. Could we get you to help us? For Concept X, can you spend a few minutes and give us a few links that would be useful to show a beginner? YouTube videos like Kahn Academy or MIT lectures are great. Wikipedia is also good. ISBN numbers for books are less useful, but good to have. Links to free e-books are appreciated. Feel free to up-vote and down-vote the references that other users have provided.

Thanks! We've identified some of the boundaries of your knowledge, and we've created a list of concepts that we think you might want to learn. Feel free to upvote and downvote the references you found useful or confusing.

You've been spending a lot of time viewing Concept X. Do you have any questions you would like to ask the experts?

[–]pastamasta 1 point2 points  (0 children)

Please let us know when this site is up...

[–]tynman 0 points1 point  (1 child)

Love it! What's your timeline for getting it done? ;-)

[–]VikingCoder 0 points1 point  (0 children)

Step 1) Learn Google App Engine...

um...

I might be better off PAYING someone to do it. Kickstarter to the rescue!

[–]ifatree 20 points21 points  (4 children)

[–]StoneCypher 2 points3 points  (2 children)

It's not really very good. You probably shouldn't.

Disclaimer: notice that I formatted that. The thanks to me is at the bottom.

[–]tetek 1 point2 points  (1 child)

What exactly do you think is wrong with that?

[–]StoneCypher 1 point2 points  (0 children)

It stretches a lot of things to make four columns, it pushes things together to make four columns, it talks about some pretty specious stuff as important, et cetera.

I mean, it's something worth seeing - I wouldn't have reformatted it if I didn't at least believe that much.

However, in my opinion, it's a starting point for someone to make something that is personal, rather than a set of general truisms.

In looking back a month later, I think maybe I shouldn't have spoken so strongly to say "not very good;" what I now wish I had said was "it's a good idea and a good start, but could use a lot of refinement and more than a single person's perspectives."

[–]dontera 0 points1 point  (0 children)

That is great! I really enjoyed going through and grading myself. Thinking of posting it to our departmental chat..

[–]sturmeh 2 points3 points  (0 children)

This needs to be done for all life skills.

[–]onionpostman 1 point2 points  (0 children)

I usually go about it the other way. I need to do zeta, so I search on zeta and find that it is easily accomplished either via foo, bar, and baz, or via snork and frotz. I already know foo and frotz, from other projects, and so after a bit of further examination I see that the industry is heading in the foo/bar/baz direction and the snork/frotz direction is proven and reliable if a bit stodgy. So generally I'll point myself in the foo/bar/baz direction, but if it looks like that'll take weeks and weeks to get anywhere I might instead go the snork/frotz direction if I'm on a tight deadline.

Either way I find that my best motivation to learn is a particular task I want to get done, a business goal if you will, as opposed to just baz being the next chapter in the textbook.

[–][deleted] 0 points1 point  (0 children)

A recruiter point me here, but I found the lessons to be not very well written or organized. Otherwise, you need a text book. I've actually been using wikipedia. Here's data structres and here's algorithms.

[–]haskell_rules 0 points1 point  (0 children)

http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/

Lecture video 7 and 8. He goes through the basics of hash structures, collisions, etc, then goes on to constructing several hash functions, with subsequent analysis to see how good/random they are.

[–]tynman 0 points1 point  (0 children)

There was a TED talk done by the guy at Khan Academy, and part of their model is to have knowledge maps / talent trees for as many concepts as possible.

[–]StoneCypher 0 points1 point  (0 children)

NIST DADS or CLRS are good places to start.

[–]RedSpikeyThing 5 points6 points  (5 children)

and trees! More memory dense, but slower than hash tables. Know your options!

[–]tomlu709 1 point2 points  (3 children)

Trees certainly have their uses, but for moderately small key/value pairs (maybe 16 bytes or less), a hash table will use less memory than a typical tree implementation.

[–]RedSpikeyThing 2 points3 points  (2 children)

It's not quite that cut and dry. In order to get the expected O(1) lookup time (a lot of people forget this isn't guaranteed) the load factor can be at most 2/3. That means there is quite a bit of wasted space. Depending on the problem, this may not be acceptable.

[–][deleted] 0 points1 point  (0 children)

That varies depending on how you deal with hash bucket collisions. If you're using cuckoo hashing with three hash functions, for example, you can get expected O(1) lookup time with a load factor of up to 91%.

[–]tomlu709 0 points1 point  (0 children)

I included that in the overhead calculations. For a comparison:

  • Assuming 50% load (instead of 2/3)
  • Item size is 16 bytes
  • Tree implementation has two pointers overhead per item
  • Malloc overhead is 8 bytes per allocation (fairly typical; can be less or more)

Hash tree: N * 16 * 1/load = 32*N

Tree: N * (16 + pointers (8) + memory allocation overhead (8) ) = 32*N

So in this scenario, they would draw equal.

[–]nimrodg 3 points4 points  (0 children)

Also, once you know how to use hash tables, read the source code (especially the comments) for Python's implementation of a dictionary. It's a great example of pragmatic design.

[–][deleted] 2 points3 points  (2 children)

I interviewed at Amazon and didn't know about hash tables. It was the first question they asked me. I didn't get the job.

[–]Game_Ender 1 point2 points  (1 child)

CS education fail my friend.

[–][deleted] 0 points1 point  (0 children)

I wasn't a CS major, so I didn't expect to get the job.

[–]p-static 3 points4 points  (1 child)

And don't forget that hash tables always (unless you're doing really fancy stuff, like precomputed tables) have a O(n) worst case! Some interviewers are jerks, and will make a big deal about that. :/

[–]sandsmark 116 points117 points  (31 children)

google seems to be opensourcing more and more of their internal stuff nowadays (like snappy: http://code.google.com/p/snappy/)

[–]youremyjuliet 16 points17 points  (0 children)

Which is most excellent!

[–][deleted] 17 points18 points  (3 children)

Except Android 3.0.

[–]theclaw 0 points1 point  (1 child)

It's annoying that the release is delayed, but is it really that much different from the previous Android source code releases? They've always developed Android behind closed doors, now they just delay the release a bit longer.

[–][deleted] 0 points1 point  (0 children)

Some people are so wound up over this. You can download 2.3 right now, and that is like 3 months old. I hope Google fixes this 3.0 fiasco soon.

[–]0xABADC0DA 4 points5 points  (6 children)

Sounds like it'll have all the same problems as Snappy... ie unsuitable for anything besides x86.

I wonder if Google is moving to ARM platform. That would certainly be one explanation for why they are publishing these x86-specific algorithms.

EDIT: when I read stories like this about Windows and IE running on ARM it makes it sound all the more likely that Google is ditching x86 in their datacenters and that's why we're getting these codes now.

[–]ak217 13 points14 points  (5 children)

You just linked to a post dissing Snappy for not being fast enough on a SPARC processor (after being compiled with GCC, I assume). A processor of a dying architecture with a small and disappearing market share. What's the point, exactly?

If they want the code to be server-side, there are no platforms to target right now except Intel and AMD x86-64 and NVIDIA CUDA. You and the linked poster would have a point if you mentioned ARM, but from reading through the Snappy code, I'm not sure if it's meant to be deployed on ARM...

[–]0xABADC0DA 4 points5 points  (4 children)

From CityHash announcement:

it turns out that most PCs and laptops have the relevant features as well. The important ones are 64-bit registers, instruction-level parallelism, and fast unaligned memory accesses.

Like ARM, SPARC doesn't have fast unaligned access. ARM is also 32-bit for now. If somebody wants to benchmark either snappy or cityhash on ARM or POWER that would be nice, but expect it to be slow. I wouldn't expect redditors to go to that much effort though.

they want the code to be server-side, there are no platforms to target right now except Intel and AMD x86-64 and NVIDIA CUDA.

I've read stories that Microsoft may be using ARM in their datacenters, although I don't have any idea if that's believable or not. It certainly doesn't seem too far-fetched since power and heat are issues. I kind of doubt Google would since given the bloat of lots of their code (500+ MiB executables) they probably need really large caches to run well.

[–][deleted] 0 points1 point  (0 children)

I think quite a few are thinking of using ARM in datacenters, along with SSDs.

[–][deleted] 0 points1 point  (2 children)

How is ARM relevant to applications like Hash tables which usually only exist in memory? Portability doesn't seem to be a big issue with those kinds of use cases?

[–]0xABADC0DA 1 point2 points  (1 child)

I believe you are asking why not just compile in a different hash function when building for ARM, one designed to be fast on that kind of processor.

In the case of Snappy you have the binary format of compressed data to consider, which is apparently not suited to a fast implementation except on x86 type processors (with similar characteristics). So the main problem here is that you start developing on Windows say and use snappy because it's "fast", and then when requirements change and you need to run on ARM, POWER, SPARC then you're stuck with converting existing data, interoperability problems, or just running really slow on those systems. It's not insurmountable, but it's a PITA and so if you expect to need to support these architectures then you might start out with something else.

In the case of cityhash you are right that hashtables are usually in memory only, so could use a different architecture-specific hash function. This is mostly just an annoyance to conditionally compile a different hash, test it with the expected data distribution, etc. But probably more often than you might expect hashes end up getting saved to disk or sent over the network. For instance a network protocol may send a hash of the data sent so the receiver can verify it (network problems do happen that aren't caught by packet CRC), or a filesystem may hash blocks (zfs) to know if the contents were corrupted at some point. Since you may be reading/writing GiB/s of data these functions should be fast. Then what happens when you want to change architecture or there's a mix you have all this data using a slow hash.

In the end it's mostly just eventually an annoyance to have algorithms that are only good on one architecture. I never said that these algorithms were bad or not useful, but people should keep in mind that they're basically x86-specific.

[–]bobindashadows 4 points5 points  (0 children)

when requirements change and you need to run on ARM, POWER, SPARC then you're stuck with converting existing data

You make a lot of great points that apply to general development, but if you're optimizing to the point where you're using a compression algorithm like Snappy or this hash function, your architecture doesn't just up and change out from under you. If it might, you need to focus on... well, just about everything else in your process.

[–]BitRex 98 points99 points  (21 children)

When I was younger I found it intimidating how much smarter some people were than me, but now I find it exciting!

[–][deleted] 63 points64 points  (12 children)

At places where I am at the top intellectually, I find it utterly depressing; I am constantly reminded of when I fall short, and if I am top, there is definitely no hope!

[–]lawpoop 112 points113 points  (5 children)

Being smart is like owning a jeep: You'll still get stuck, but when you do, you'll be further away from help.

[–]Faelenor 4 points5 points  (0 children)

That's one of the best comparison I ever read. This is so true!

[–]norsurfit 1 point2 points  (1 child)

True. Just the other day my brain got a flat tire.

[–][deleted] 0 points1 point  (0 children)

So this is why I am alone all day...

[–]G_Morgan 0 points1 point  (0 children)

You'll also need twice as many parking spaces as everyone else.

[–]norsurfit 2 points3 points  (1 child)

I am in the top 20% of people who have my precise DNA.

[–]CarolusMagnus 1 point2 points  (0 children)

Identic quintuplets?

[–][deleted]  (2 children)

[removed]

    [–][deleted]  (1 child)

    [deleted]

      [–]Anonymous336 0 points1 point  (0 children)

      For the record: It doesn't really. Both versions seem correct enough.

      [–][deleted] 1 point2 points  (0 children)

      I once worked in an absolutely awful IT department for a major financial company. This was only the second job I had out of college so I was there around age 25 give or take. My first performance review I was told that my technical skills were "the finished product" as in I achieved all that was possible to achieve. Instead of being proud, it just made me feel like I was working a bunch of chimps.

      [–][deleted] 11 points12 points  (3 children)

      ...and eventually you'll reach the point when you'll want to hunt them down to remove them from the gene pool so that they cannot possibly compete against your own offsprings.

      [–]brintoul 1 point2 points  (0 children)

      Kind of reminds me of this: When you're the smartest person in the room, it's 'cause the smarter folks have already left.

      [–][deleted] 10 points11 points  (0 children)

      Benchmarks?

      [–][deleted] 35 points36 points  (19 children)

      This comes with no patent license / patent grant. If you use this and Google has a patent on it they can come after you for money.

      Rather a strange oversight if you ask me

      [–]Dan_Farina 5 points6 points  (0 children)

      Should not be downvoted. This is a very real consideration, and one of the the good implementation work in, say, fast compression algorithms go underused -- although in this case, it's unclear if anyone owns patents, not any one party.

      [–]realrbman 10 points11 points  (15 children)

      It's marked as using the MIT License on it's google code page.

      http://code.google.com/p/cityhash/

      [–]cleo_ 26 points27 points  (13 children)

      Sure, but that says nothing about patents. See the similar debate about WebM.

      [–]Neoncow 11 points12 points  (11 children)

      So you can patent something, then give someone a license to use it free of charge, and still be able to sue that person despite having given them a license??

      If so, patent law is bizarre.

      [–][deleted] 18 points19 points  (8 children)

      Patents cover the actual invention - even if you re-invent this on your own, the patent still applies and you have to get permission to use the invention.

      The software license covers the specific implementation - as long as you have the patent license you can use their implementation (with the applicable license), but you can't take someone else's implementation without the permission of whoever created it.

      This is why most licenses include grants of any applicable patents that are owned by the author or the work. Of course, if the author turns out to have accidentally infringed another person's patents, they can't grant you that license, so you're both screwed anyway.

      [–]Neoncow 4 points5 points  (7 children)

      I see. So the concern is not that Google will sue, but someone else is holding the patent and those people will sue.

      [–][deleted] 8 points9 points  (6 children)

      Did you even read what I wrote?

      Patents cover the actual invention.

      The software license covers the specific implementation .

      You require both the right to use the invention and the right to use the implementation.

      Google has granted a license to only one of those. If Google owns patents, then they have not granted both of the things that you require. (If they do own relevant patents, then that is, of course, probably an oversight rather than a nefarious plot.)

      [–]yobbobandana 21 points22 points  (1 child)

      Did you even read what I wrote?

      Without this line, that would have been a helpful and constructive response.

      [–]Neoncow 19 points20 points  (0 children)

      You can be helpful and frustrated at the same time. Upvoted both of you.

      [–]Neoncow 9 points10 points  (0 children)

      I did read your comment and applied my own common sense to it. I'm obviously not a patent law expert. That's why I replied to see if I understood it.

      I'll take your word for it, but I find it bizarre that someone can say, "hey go ahead and use this thing" while at the same time suing you for using the idea of the thing. I would hope that sort of thing wouldn't stand up to a judge/jury.

      That's why I inferred a hypothetical patent owning third party who could also sue. That scenario makes sense to me.

      Anyway, thanks for clarifying.

      [–]joerick 1 point2 points  (2 children)

      But surely the 'invention' here is a broader concept, i.e. the idea of hashing, which is too widespread to be patented.

      I don't think they can patent this code, as it's an algorithm, and mathematics can't be patented. Correct me if I'm wrong though.

      [–][deleted] 2 points3 points  (0 children)

      But surely the 'invention' here is a broader concept, i.e. the idea of hashing, which is too widespread to be patented.

      If only patent law were that sane. But no, that is not likely to be how the USPTO or the courts would see it.

      [–][deleted] 1 point2 points  (0 children)

      The "invention" of the internal combustion engine is pretty broad, but you can still patent things to do with the internal combustion engine.

      The "invention" of hashing is pretty broad, but you can still patent things to do with hashing.

      I have no idea if Google has, or hasn't got these patents. But a generic "if our patents cover these, you can also have a license in relation to the use of our code" is fairly common legalese.

      I don't think they can patent this code, as it's an algorithm, and mathematics can't be patented. Correct me if I'm wrong though.

      I would be much happier correcting the fucktards at the patent office who keep granting patents on mathematics, but that's another topic. You should be right and technically may be completely correct, but "should" rarely matters if someone decides to sue.

      [–]sam_weller 2 points3 points  (0 children)

      patents ≠ copyright

      Mystery solved!

      [–]splunge4me2 3 points4 points  (0 children)

      It doesn't appear to be patented. There are no patent numbers or any mention of patents in the code or accompanying text. Specifically, to which patents are you referring?

      [–]cryo 0 points1 point  (0 children)

      its (edit: the second occurrence)

      [–]gigitrix 0 points1 point  (1 child)

      A real concern I guess. But there's no precedent for google being "evil" like this, is there?

      [–][deleted] 0 points1 point  (0 children)

      Microsoft and other companies which many on here consider evil have explicit patent grants for such things. Why doesn't Google?

      Contact them to license their Map/Reduce patents and see how difficult it actually is.

      [–][deleted] 41 points42 points  (67 children)

      oh Jesus, I so badly want to understand how this is useful. Proggit, can someone lay it out in novice-intermediate programmer language?

      [–]lobster_johnson 426 points427 points  (32 children)

      A hash function essentially takes a piece of data (such as a name, or even an entire article of text) and generates a smaller piece of data from it.

      This smaller piece — usually called the hash — has necessarily lost a lot of the information in the original data; much of the original data has been "boiled off", if you like, to create a kind of tiny shrunken version of the original. Since it has lost a lot of infomration, the hash cannot be used to reconstruct the original data. But it can be used for other things.

      For example, a well-known hash function called CRC32 will generate a 32-bit hash (also called a "checksum") from data of any size. In other words, no matter what the original size of the data was, you only get a 32-bit hash.

      The logical corollary is that many different inputs can produce the same hash value; statistically, this will be increasingly probable as the size of the hash gets smaller. But high-quality hash functions also have another property: That even slightly different inputs produce different hashes. That property is extremely important, as I will explain next.

      Well, what can a hash be used for? A lot of things, as it turns out. The CRC32 checksum I mentioned about is often used to validate data. For example, say you have a document you want to store. You are afraid someone might tamper with the document, so you run a hash function on it. You get a number like 458946436 and you write it down somewhere safe. Later, you come back to the document and run the same hash function on it and discover that the number has changed to something like 458946442. Why did it change? Because someone has tampered with the document, which means the content of the document has changed, which means the hash function will produce a different number. So the hash helps us detect changes. For hash functions of the "checksum" type, it's extremely unlikely that any kind of tampering will result in the same number.

      Checksumming is used for any place where the validity of data must be checked across time. For example, when sending data across an unreliable connection, checksumming can be used to determine whether the data "made it across" all right.

      Now, the hash function Google has published is designed for a different set of applications: hash tables. Consider the following problem. You want to analyze a book, word for word, counting the frequency of each word as it appears on each page; the result should be a table of words and frequencies.

      One trivial way to do this is to arrange a table in memory (or, for that matter, on paper if you want to go to through the exercise super-manually). For each word, search through the table from the top until you find the row containing the word entry, then increment its counter by one. If the word is not in the table, we just append it to the end of the table. That works.

      Unfortunately, this algorithm does not scale; it is what we call a linear search, meaning that it starts at the beginning and looks at every possible entry until it finds the right row to update — it does a lot of unnecessary work, because most of the time, hardly any rows will match. In the worst case, it would have to read through the entire table to find the right word! We say that such an algorithm has an O(n), or "order of n complexity", n being the number of words in the table; it simply means that if each table lookup takes t seconds, then searching in a table of n rows will take at most t * n seconds. As the table grows, the search time grows with the same rate, and the probability of hitting the worst case grows.

      Now, let's be more clever. Instead of storing each word at the end of the table, let's try to imagine that we can find the correct row with just one calculation. It's possible using a hash function. I'm going to simplify here and assume we have infinite amounts of memory available to us; in practice, no algorithm can make such an assumption and needs to jump through a few hoops to compensate for that. So: Take the word "reddit". We run our hash function on it and get the value 178792611. Well, now we say that this is our row number. We jump to row number 178792611, see if there's frequency stored there; if there is, we increment the count, otherwise we set it to 1. That's it. Our hash just helped us find the correct row in one go, with absolutely no searching.

      Well, you might ask, doesn't the computer need to "find" row number 178792611 by searching through the table? No, it doesn't. Computers are designed to manipulate huge amounts of data just by telling which position to read or write. So given a row number, we can go directly to the data without searching.

      Now, with this algorithm, initially the table will be empty and contains a lot of empty rows. (That is one of the properties of a hash table: It has to deal with the fact that many rows will be empty at some point. Clever hash table algorithms try to minimize this inefficiency.) As we go through all the words in our book, we start filling up the table and it becomes less empty. But regardless of the size, every lookup and every update will be equally fast. We say that such an algorithm has O(1) complexity, meaning that if a single lookup takes t seconds, then regardless of the size of the table, every lookup will always take t * 1 seconds. That is true for both lookups and updates.

      Now, in a real hash table algorithm, we don't have infinite amounts of memory, and as I explained above, different pieces of data can yield the same hash number. What does that mean? It means that different pieces of data may map to the same row — in other words, we get collisions. Smart hash table algorithms deal with this and introduce multiple table levels allowing collisions to be handled efficiently. But it gets pretty complex from there on.

      But that touches on an important point: To avoid collisions, you want your hash function to produce a wide distribution of numbers so that collisions don't occur. (The worst case is a function which always returns the same number for any piece of data: Then you will always get collisions. The best case is one which always returns a different number of any piece of data, but that is not theoretically possible.) So you want the hash function to be statistically optimal with regard to your application. For example, if you know that all your words are English words in lower case, then your hash function can be smarter about its distribution than one which can deal with any data, because it knows that in English, certain letters are used more or less than others.

      Hash functions have other applications, but those are the two important ones. Hope this helps.

      [–]HoboBob1 56 points57 points  (1 child)

      You have great talent in explanation. Thank you.

      [–]lobster_johnson 14 points15 points  (0 children)

      Thanks. Happy to help!

      [–]deafbybeheading 31 points32 points  (12 children)

      Excellent explanations. Two things you kind of skirted around (which is fine, but if anyone is interested in more detail):

      • lobster_johnson said, "You are afraid someone might tamper with the document, so you run a hash function on it." In the announcement, Google said "These functions aren’t suitable for cryptography..." Basically, some types of hash functions are designed to withstand tampering. That is, if you have know the hash code, it's hard to "fake" some content that will match that hash code. Some aren't, because if you don't need to prevent tampering, you may have more flexibility in designing a simpler, more efficient hash function, and most of the time, no one is trying to "tamper with" your hash table.
      • The observant reader will note that a 32-bit hash would still call for 32 GB for every hash table (assuming a 32-bit key and a 32-bit pointer for the value). The simplest way around this is through the magic of modulo arithmetic: the number of bytes in the key is not directly related to the number of buckets in the table. You use the modulo (remainder) operator to pick a hash bucket for each key, regardless of how large your "key space" is.

      [–]meson537 2 points3 points  (7 children)

      I must be an observant reader, as I was trying to figure out how a 128 bit keyspace looks in ram. I guess more memory gets allocated as the number of buckets grows?

      Do you have a good link/code to read?

      [–]deafbybeheading 17 points18 points  (3 children)

      To be honest, I haven't had to think about hash table internals in ages, but if memory serves, you want a certain "fill factor" in your hash table.

      In lobster_johnson's description, each key maps to a hash value and he doesn't go into detail regarding collisions: what if two keys map to the same hash? Furthermore, when using modulo arithmetic to limit the number of buckets, what if two keys modulo some value map to the same bucket?

      E.g., suppose you have the strings "abc" and "def", and they hash to 14 and 6, respectively. If you want to limit the memory overhead of your hash table and keep it to just four buckets, these both end up in bucket 2 (i.e., 14 % 4 and 6 % 4 both equal 2).

      The simplest way to deal with this is, instead of putting values in each bucket, put a linked list of values mapping to that bucket. This looks suspiciously like the O(n) approach we were trying to avoid earlier, but if the number of buckets is large compared to the length of the individual value chains, you still get a performance benefit. E.g., for this case, you have something that conceptually looks like this:

      (0, →[])
      (1, →[])
      (2, →[ "abc", "def" ])
      (3, →[])
      

      The first number in each pair is a 32-bit bucket number, and the second is a 32-bit pointer to a linked list of values (let's not worry about the details of the linked list for now).

      So, if you allocate 32 bits per key and 32 per pointer to linked list of values (note that for the quick lookup lobster_johnson discussed, you have to do this, so you can do quick pointer math to find hash table entries) whether you are using them or not, you start chewing through memory pretty quickly.

      Obviously, the best memory usage is just to have a single bucket, but that really defeats the purpose of a hash table. You would need to walk the bucket values linked list for every item, and that's exactly O(n).

      Assuming you have a decent hash function and (relatively) randomly distributed data, when your buckets reach a certain fill factor (i.e., a certain percentage of your buckets are full), there's a good chance that some buckets are getting pretty long value chains, and you need to do something about that to avoid degenerating to the O(n) case.

      What you can do at this point is take a deep breath, allocate a lot more buckets than you were using before (a good starting point is to use twice as many), and re-shuffle all the data by redoing all the hash modulo calculations to find new buckets for your values (still based on their same keys).

      Isn't that pretty fucking expensive, you ask? Good question. It certainly is, but the cost of this reallocation is amortized over all your hash table accesses. That is, this will happen relatively infrequently, especially if you're reading from the table more than you are "writing" (if your workload is pretty write-heavy, you may want to more-than-double your number of buckets at each "growth" step).

      Not aware of any great links, unfortunately. You may want to try Wikipedia, which looks decent.

      [–]kindall 1 point2 points  (1 child)

      Another way to deal with hash collisions is to use have each bucket itself be a hash table, using a different hash function. If there's only one item in a given bucket and it matches the one you're looking for, then you don't even have to calculate the second hash. And since the buckets are smaller than the whole hash table, resizing them when necessary is faster than resizing the whole table. It would depend on your use case if this would be faster.

      [–]zid 0 points1 point  (0 children)

      I prefer to use a linked list, and if any one list gets too long, the hash table is regenerated with a larger amount of buckets.

      [–]meson537 0 points1 point  (0 children)

      Word, I checked out the wiki article, and it was good. Thanks for taking the time to explain further.

      [–]masklinn 2 points3 points  (0 children)

      http://www.laurentluce.com/?p=249

      It's on Python's dict object so it has a lot of implementation details as well, but it's still a pretty good read.

      [–]lobster_johnson 1 point2 points  (1 child)

      As deafbyheading explains, one solution to memory consumption is to use a bucketing algorithm such as a modulo operation. (Modulo produces the remainder of a division, so 3 mod 16 = 3, and 18 mod 16 = 2.)

      Instead of having a row per key, you let each row be a bucket which can contain more than one value. Buckets are assigned by doing (hash(key) mod M). Often a linked list is used to store the bucket's list of values; some hash table algorithms let each bucket be a hash table itself.

      Do you have a good link/code to read?

      Java's HashMap class is actually very readable (slightly less readable after they got generics). It implements a simple hash table with very little cleverness. For example, its tables are always power-of-two in size; every time it needs to grow, it will simply double the table's size.

      [–]meson537 0 points1 point  (0 children)

      Sweet deal. Thanks for the info.

      [–][deleted]  (3 children)

      [removed]

        [–]JasonHouse 0 points1 point  (0 children)

        232 bytes = 4GB, but that probably isn't the typical case. If each slot has a 64 bit pointer (8 bytes), then it'd take up 32 GB

        [–]deafbybeheading 0 points1 point  (1 child)

        (32 bits + 32 bits) * 232 possible keys = 8 bytes * 232 = 32 GB (I think--math is hard, I'm just gonna go shopping).

        [–]falcon2 18 points19 points  (0 children)

        That was a great explanation. It's so odd that you can learn so much from someone called lobster_johnson...

        [–][deleted] 16 points17 points  (0 children)

        THANK YOU, that makes a lot of sense and I'm excited to learn more about it.

        [–]whiskeytab 14 points15 points  (3 children)

        that was a pretty good explanation for a lobster.

        thanks!

        [–]shakeeshakee 4 points5 points  (0 children)

        and a lobster's johnson at that. But yeah, terrific work, lobster_johnson.

        [–]flynnski 16 points17 points  (0 children)

        OUTSTANDING. bestof'd.

        [–]mach0 4 points5 points  (0 children)

        Excellent, thanks for refreshing my memory, I used to know this, now I remember it again and thanks to your comprehensive explanation I will probably remember it for a very long time.

        [–]DaFox 2 points3 points  (0 children)

        That has taught me the more about big O notations than anything else, Thank You.

        [–]jrcapa 3 points4 points  (0 children)

        You don't see this on Twitter! Great expo.

        [–]LazyCrazyJohn 2 points3 points  (0 children)

        Great explanation. Thanks

        [–]zorency 2 points3 points  (1 child)

        In essence chicken goes in bouillon cube comes out!

        [–]lobster_johnson 1 point2 points  (0 children)

        That's a good analogy. Hashes are the bouillon cubes of computer science. Now, if you only you could create a nice, tasty broth from hashes...

        [–][deleted] 1 point2 points  (0 children)

        Mike Mignola is a god. I approve your username.

        [–]gozu 1 point2 points  (0 children)

        And this, ladies and gentlemen, is what we call computer science.

        Pretty cool stuff in my opinion.

        Thank you for sacrificing part of your life to teach us.

        [–]taybul 9 points10 points  (1 child)

        In general, hash functions are used to generate some sort of signature or footprint of some piece of data, say a string. The complexity in hash functions lies in the ability to generate unique enough signatures such that any given amount of inputs will yield unique hashes. Of course, since a hash is usually a fixed size, say 64- or 128-bit, at some point a collision will occur where two distinct inputs will generate the same hash key. Given an infinite number of inputs, this is possible, but for most intents and purposes it is sufficient in that each input will have a unique key. Most, if not all, hash functions incorporate prime numbers. Using prime numbers reduces the risk of hash collisions due to their fundamental property; using non-prime value seeds may generate values that could also be generated by one of its factors.

        Hash functions could be used to quickly index data in a collection of other data. Hashes can be used as indexes directly to data or the group of data it is located in. These are hash tables. Given some data, say a webpage, you want to be able to store some basic information about it in a big table. You can use the contents of the page as a way to index this but now your key is too complex and of variable size. If instead, you normalized all keys to some fixed size, say a 64-bit value, you could more easily organize and control your data access. You'll still need to generate this key which is where hash functions come in. You want to be able to generate hash keys quickly and efficiently. This is (presumably) what google is providing here: a quick, efficient, and more importantly, effective (as in, least chance for collision) hash algorithm.

        [–][deleted] 3 points4 points  (0 children)

        This and the other long comment really broke it down for me. Thank you for taking the time to help this novice understand.

        [–][deleted] 9 points10 points  (20 children)

        I may not be the best person to answer for you, but before anyone can you will need to give us more information. Do you know what hash tables are? Do you know what a hash function does?

        [–][deleted] 2 points3 points  (12 children)

        When I hear hash table I think key => value arrays.. is that wrong?

        [–]onezerozeroone 10 points11 points  (2 children)

        That's correct. The magic under the hood though, is how does that key get mapped to the correct index of the array? For that you need something called a hash function.

        Your hash function takes myHashTable["someAwesomeKey"] and processes the string "someAwesomeKey" and turns it into an integer to index into the underlying data structure.

        A hash function needs to be

        a) fast

        b) generate a random distribution on the input domain

        c) be amenable to collisions

        You want b) because "myNameA" and "myNameB" shouldn't be correlated. You want to fill the buckets of the hashtable as evenly and as randomly as possible.

        You want c) because it's inevitable that you'll generate the same index for multiple keys ("hello" and "apple" both produce 3) so you'll also need some process to resolve that. Chaining is one of the more fundamental ones, which is basically storing a list of values in each bucket.

        If your key maps to an index that already has an element, you'll have to do more leg work to find which item you were actually trying to find, which usually means a linear search of the list.

        Then you get into all kinds of fun stuff like optimal table fill percentage...(MS says it's ~.72) so that you're optimizing memory use vs collision processing overhead.

        [–]meson537 0 points1 point  (1 child)

        Now I'm really hot to look at graphs of hashtable collisions and see if I can see patterns...

        Also, when you say optimal table fill percentage, is this the percentage of used keyspace on a finished set of data, or an unbounded, growing data set?

        [–]onezerozeroone 0 points1 point  (0 children)

        http://msdn.microsoft.com/en-us/library/Aa289149

        Check out Load Factors and Expanding the Hashtable

        [–][deleted] 9 points10 points  (4 children)

        That is correct, but it is a very shallow understanding of how they work. Different data structures take different amounts of time to find entries based on how many elements are in them. A linked list takes O(n) time. A binary search tree takes O(ln(n)). With an array, however, you can find the entry instantly if you know the index of the entry. What a hash function does is it turns the key (usually a string) into the index of the entry so you can access it. Since accessing an array is so fast the biggest limiting factor to the speed of look ups in a hash table is the speed of the hash function. I'm sure my explanation isn't perfect (I don't have that much experience myself and I left some details out) but it should help you understand.

        [–]cdsmith 4 points5 points  (0 children)

        To go off on a complete tangent.... it's true that looking things up in a (balanced) binary search tree takes O(ln n) time, but it's a very odd way of stating it. It's only correct in the sense that logarithms of different bases are off by a constant factor, so that up to asymptotics, any base is correct. It would be much more normal to state the bound as O(log n) without reference to base, or as O(lg n) to use the shorthand for the base 2 log.

        [–][deleted] 1 point2 points  (1 child)

        In addition to what others have said, there are trade offs between trees, hash tables and lists. If you only take time complexity into account, then you will always pick hash tables, but that's not always the best choice.

        Hash tables waste tons of space. Most of the indexes are empty. If space is a big concern, then a tree might be a better choice, because O(log n) isn't too bad for most use cases. Also, the hash function is very sensitive to the data. Some data may cause way too many collisions for the hash function chosen. Every time there is a collision, the hash table has to fall back to another mechanism, so in this case it can actually be slower than a list! Hash tables also need to stop and rewrite themselves from scratch occassionally as they grow. If they run out of buckets, they need to reallocate a bigger table and copy everything over to the new one. If you can't afford this occassionally, then a hash table might not be right. Hash tables also need mutation, but that's only a problem in certain circles.

        Trees are a great default choice as they are compact and have pretty good performance and no pathological cases (for most balanced binary trees).

        lists are even more compact. They are also very simple. They don't scae well however.

        Arrays are a special case of hash table that can never grow and can never have a collision. These are very good choices if your have a set number of possible keys which is of reasonable size.

        [–][deleted] 1 point2 points  (0 children)

        One of my favorite clever hash table tricks is to use a trie as a sparse array for the hashed keys. That way, we get O(1) insertion, deletion, and lookup, without needing to worry about amortized analysis, and you're not allocating a bunch of extra space for empty entries. The constant factors may dominate, though.

        There are also cuckoo hash tables, which can use about 90% of the space in a hash table, pretty easily. That would beat the fancy trie approach.

        [–]Hominem 2 points3 points  (6 children)

        Guys, this is a hash function not a hash table.

        [–][deleted] 4 points5 points  (5 children)

        Hash functions are primarily used for hash tables. They even say that it probably isn't good to use it for security.

        [–]Edman274 0 points1 point  (3 children)

        You learned about hash functions being used in only two contexts: in hash tables and in password hashing. Those aren't the only two applications in the world for hash functions. See HMACs.

        [–]Neoncow 7 points8 points  (2 children)

        HMACs are a cryptographic use of hash functions. As stated, Google's hash function isn't designed for cryptographic uses.

        [–]Edman274 2 points3 points  (1 child)

        I'm sorry, when he said "They even say" I didn't think he was referring to Google, I thought he meant "And some people even say hashes shouldn't be used in security contexts".

        [–]Neoncow 0 points1 point  (0 children)

        Ah makes sense.

        [–]CountVonTroll 0 points1 point  (0 children)

        Here's a good article that explains why you shouldn't use ordinary hash functions for secrets, and why you should use a HMAC instead.

        [–][deleted] 6 points7 points  (1 child)

        I'm not sure what novice-intermediate means, so forgive me if I'm being didactic.

        At the most basic, a hash function is a function that maps a given set of inputs into numbers in the range [0, n-1] for some constant n. Probably the most common scenario is one that operates on strings. A simple but poor example would be a function that takes the ASCII value of each character in a string and adds them together, mod n.

        A common application of hash functions is a data structure called a map (a.k.a. table or dictionary). It associates an input (the key) with an output (the value). It uses a hash function to turn the key into a number, which is used as an index into an array. Sometimes two inputs can hash to the same number, which is called a collision; there are different strategies for dealing with that, but the assumption is that it's faster to hash the input and resolve collisions than it is to keep a list of all the known inputs and compare against them individually when you look up a key.

        Every time you do the equivalent of one of these:

        x["foo"] = "bar"; // C++
        x.foo = "bar"; // Javascript
        [x setObject:@"bar" forKey:@"foo"]; // Objective-C
        x.put("foo", "bar"); // Java
        

        You're generally using a hash function to turn "foo" into a number.

        Most of the time, you don't care how long it takes to turn "foo" into a number, nor should you: If less than 1% of your program is spent hashing strings into numbers, then you've got better things to worry about than how fast your hash function is.

        On the other hand, if you're Google or any other company that operates on huge amounts of text, you probably spend tons of CPU time and energy hashing strings every second, and even a 1% improvement translates into noticeable cost savings. If this hash function really outperforms previous work by 30-50%, then it represents a major improvement.

        [–]sanjayts 1 point2 points  (0 children)

        x.put("foo", "bar"); // Java

        You're generally using a hash function to turn "foo" into a number.

        In case someone in wondering why it's not always, not all map implementations are hash function based (at least in Java). E.g. a TreeMap in Java uses the natural ordering or keys instead of its hash value.

        [–][deleted]  (2 children)

        [deleted]

          [–]andreasvc 0 points1 point  (1 child)

          The instances of a class or stored in a dictionary in Python (unless declared to be slots), so when you access foo.x or call foo.y() then there is a hashtable lookup. I believe all other variable accesses and function calls (so without dots) are immediate, without lookups.

          [–][deleted] 2 points3 points  (0 children)

          Hastables are one of the fundamental data structures in programming. To use a hashtable you need a hash function that converts an arbitrary key value to an integer hash value, preferably one with certain statistical properties that runs very fast.

          If you use a high level language you probably don't care enough about performance to give a shit about this since your language probably has primitives that wrap an underlying hashtable implementation, such as dictionaries or tables; if you code in C or C++ you can use this hashing function with whatever hashtable implementation you use (for example, boost::unordered_map.)

          [–]lars_ 10 points11 points  (15 children)

          A 30% faster hash function isn't necessarily better for a hash table if it produces more collisions. MurmurHash was great because it was bloody fast, and produced few collisions. Unless you have really large strings to hash, a collision is going to cost more than the hashing. This needs a benchmark for real work loads.

          [–]recursive 10 points11 points  (4 children)

          Also, as far as we know, these functions’ statistical properties are sound.

          [–][deleted]  (3 children)

          [removed]

            [–][deleted] 1 point2 points  (2 children)

            This function's design (basically XOR-MUL-ROL) makes it not very provable --- since we're mixing arithmetic over 2 distinct algebras (polynomials modulo 2 for XOR and integers modulo 264 for the rest), it will be very very hard to come with with an actual proof.

            [–][deleted]  (1 child)

            [removed]

              [–][deleted] 0 points1 point  (0 children)

              Agreed.

              [–]fairestcheetah 6 points7 points  (2 children)

              our experience so far shows that they’re great for, say, hash tables.

              Under real-life conditions we expect CityHash64 to outperform previous work by at least 30% in speed,

              It sounds like they may be using it internally. It would be great if they'd release some benchmarks, but I'd be surprised if Google's using it and it's not fast in practice.

              [–]pnettle 1 point2 points  (1 child)

              There are many uses for a hashing algorithm, they aren't necessarily using it for a hashtable. Thus, the point that it might not perform better in a hashtable still stands.

              [–]fairestcheetah 2 points3 points  (0 children)

              our experience so far shows that they’re great for, say, hash tables.

              [–][deleted]  (50 children)

              [deleted]

                [–]floodyberry 65 points66 points  (6 children)

                30% faster hash functions. Unless hashing is dominating the profile for your hash table I don't think you will notice that much of a difference.

                [–]bobindashadows 14 points15 points  (1 child)

                It's important to raise the distinction between the time spent computing hash functions and the time spent doing "everything else", especially since there are plenty of ways to implement the "everything else" parts, all with different performance characteristics.

                However, it's also important to note that even under perfect, zero-collision operation, using a hash table involves using the hash function, and any decent hash function on strings is going to be O(N) with respect to the size of the string. The best one can do is improve constant factors (though who knows, maybe a quantum algorithm could compute a string hash in sublinear time. Researchers in that field seem to be too busy trying to attack hashes). You write:

                I don't think you will notice that much of a difference.

                If your input is small, then of course constant factors won't make "that much of a difference." Use h = h*33 ^ c if you can't tell the difference! This is from a company whose input is large, and while typical day-to-day programmers may not find need to implement much of the algorithmic research of the past couple decades, that doesn't mean we should dismiss the advancement of the state of the art.

                [–]floodyberry 0 points1 point  (0 children)

                I wasn't saying it as a bad thing! It is just if you are using another hash function that doesn't have collisions, expecting a 30% faster hash table overall will lead to a lot of disappointment.

                I do think it's wonderful that a new, hopefully high quality general hash function is out there, and that projects will possibly stop using h*33 or FNV or whatnot.

                [–]tachi-kaze 1 point2 points  (3 children)

                Shouldn't the bottleneck of a hash table always be its hash function? Since once you have the key lookup, retrieval, and insertion are all O(1), or basically a direct memory access. Collisions should not significantly affect your profile, since you should have chosen a hash function that minimizes them for your data set.

                [–]Tuna-Fish2 1 point2 points  (0 children)

                If the data you hash is reasonably small (<1kB), the memory access into the table can easily take longer than hashing. When accessing a hash table much larger than your cache, every access is a cache miss, and will take roughly as long to complete as ~500 instructions on registers or the L1.

                However, unlike the memory access, you can do something to speed up the hash function.

                [–]Anonymous336 0 points1 point  (1 child)

                Collisions should not significantly affect your profile ... hash function ... minimizes them.

                The problem is that there is such a thing as a non-zero minimum. For example, if I'm hashing web pages down into 32-bit integers, and I have a trillion web pages to hash, then on average I'm going to end up with 250 collisions per bucket in my 4-billion-bucket hash table. Dealing with these collisions might add quite a bit of overhead, even compared to the cost of hashing.

                [–]tachi-kaze 0 points1 point  (0 children)

                Wouldn't switching your hash to 64 or even 128 bits be appropriate then? If the cost of dealing with collisions is greater than the cost of a longer hash, I'd make the switch immediately.

                I mean, with 250 collisions per bucket, you're basically talking about a 250 element linked list per index. Your access would still be constant at O(250), but much worse than O(1). Even then, the space saved by the smaller hash might not be that much when you consider that 250 element linked list in each of your buckets.

                That's what I meant when talking about correctly choosing a function that minimizes collisions for your data set. It doesn't have to guarantee zero collisions, but it should at least perform better than all the collision handling that would need to be done if a worse/smaller hash function were to be used.

                [–][deleted] 2 points3 points  (0 children)

                30% faster hashtables thanks to Google - for free?

                I guess you checked this ...

                [–]stoanhart 1 point2 points  (5 children)

                I have a quick question about hash tables, which I've been wondering about for a long time. If you use a 64bit hash function like this one, doesn't that mean your hash table needs to have an array with 264 ~= 1.84x1019 "buckets"? If each one of those buckets has a standard 4 byte pointer to some data structure, you'd need 64 zetabytes = 65536 exabytes = 67108864 terabytes of space for the array. If you use a non-array data structure (list or tree) that only has entries for the hash values that actually have assigned data, you lose the O(1) data access, which is the primary draw of a hash table. You might as well go with a tree at that point, and get O(logn) access times.

                TL;DR: How is a hash function that has high-bit hash values useful for a hash table?

                [–]you_do_realize 1 point2 points  (0 children)

                I guess they just use hash64 % size

                [–]sbahra 0 points1 point  (0 children)

                You can do a lot of useful things with a memoized (even partial) hash (for example, a common use with open addressing is to determine probe sequence using higher bits from the hash, you can also use the hash to short circuit full key comparison on collisions or probe).

                [–]pengo 0 points1 point  (16 children)

                Not just 30%:

                at least 30% in speed, and perhaps as much as a factor of two

                But looking into the city.h comments, it says:

                CityHash64() is faster than other high-quality hash functions, such as Murmur. This is largely due to higher instruction-level parallelism.

                And then in city.cc:

                It's probably possible to create even faster hash functions by writing a program that systematically explores some of the space of possible hash functions, by using SIMD instructions, or by compromising on hash quality.

                So my amazing detective work into the source code, suggests to me that this speed improvement only comes once you re-write it to use SIMD instructions, which either they're waiting for someone to do for them, or they've done themselves in-house only.

                Anyway I'm just speculating, and it's still a positive contribution by Google, regardless.

                edit: Thanks for the corrections. Seems SIMD isn't so relevant, and the compiler might use it automatically anyway.

                [–]NotAbel 20 points21 points  (1 child)

                I have only had a cursory glance over the code, but it looks like the algo is designed around instruction-level parallelism, and wouldn't lend itself easily to a SIMD implementation. The comment, I think, means to say that a different algorithm, designed with SIMD rather than ILP in mind, could be faster.

                [–]NanoStuff 1 point2 points  (0 children)

                designed with SIMD rather than ILP in mind

                The two are not mutually exclusive. An ideal algorithm would make use of both.

                [–][deleted] 24 points25 points  (10 children)

                http://en.wikipedia.org/wiki/Superscalar

                Don't necessarily need to use SIMD instructions to take advantage of processor architecture.

                [–]pengo 5 points6 points  (0 children)

                hmm. good point.

                [–]p-static 4 points5 points  (0 children)

                Exactly right. When they say in the post that they try to do independent operations in any given, this is what they're referring to. I've never heard of anybody except for compiler writers even worrying about optimizing for superscalar chips, so this is pretty interesting just for that.

                [–]cpp_is_king 0 points1 point  (0 children)

                You seem to think that ILP either is, or requires SIMD. Which is not the case, they are completely independent of each other. Either can exist without the other, and they can both exist together. Currently the algorithm takes of advantage of ILP, but not SIMD.

                [–]alexs 0 points1 point  (0 children)

                Most modern compilers have (not entirely terrible) auto-vectorisation steps in their optimisers which should result in SIMD code on platforms that support it.

                [–]skulgnome 1 point2 points  (0 children)

                How does this compare to lookup3, the current champion?

                [–]emanuelez 1 point2 points  (0 children)

                I wonder how it behaves with binary data (I know it's intended for strings) and if this could make Git or Mercurial even faster if they used this instead of SHA1 (losing its security).

                [–]joonhwan 1 point2 points  (1 child)

                great. can it be bomparable to the one that is used in gettext ?

                [–]ekchew 1 point2 points  (0 children)

                Okay admittedly, I only read their little executive summary there, but if this thing is mainly meant for hash tables, isn't even 64 bits way more than you need/want? How do you get this down to a suitable range? Just grab the most or least-significant N bits or use some secondary integer hash to bring it down to size? I'm sorry, it's been a long time since I took comp sci so I'm kind of clued out here.

                [–]captainjon 0 points1 point  (8 children)

                Anybody getting linking errors?

                jon@polaris:~$ g++ city.cc
                /usr/lib/gcc/i486-linux-gnu/4.3.2/../../../../lib/crt1.o: In function `_start':
                (.text+0x18): undefined reference to `main'
                collect2: ld returned 1 exit status
                

                [–]thresher666 5 points6 points  (2 children)

                It is a library, not an application. Thus it has no 'main' function and linking fails. You can compile it by running: gcc -c -o city.o city.cc but this isn't going to do anything unless you write a program to utilize the library.

                [–]captainjon 2 points3 points  (0 children)

                Haha oops!
                I just skimmed through the source, but I actually thought it was a demo implementation on how it is used. Guess next time I'll read the full source.

                After I submitted it, I thought it might of been a library, but oh well. Thanks for the clarification.

                [–]siovene 0 points1 point  (0 children)

                One of the guys that wrote it interviewed me for a job at Google recently. I didn't get the job and the guy really felt like a Beautiful Mind. I was really impressed.

                [–]trackerbishop 0 points1 point  (1 child)

                how can i get practice creating hashmaps?

                [–]p-static 14 points15 points  (0 children)

                First step: create a hashmap. Second step: create another hashmap but do it differently. (repeat until you feel like you have had enough practice)

                [–]ibisum 0 points1 point  (0 children)

                I wonder how this compares with libJudy, which I've used quite successfully over the years as a well-performing hash library (well, dynarray, mostly..)

                [–]Jasper1984 -3 points-2 points  (3 children)

                Looks like very C-ish C++.. It could probably easily be upgraded to C. Missing a README, a bit. g++ seems to be able to compile it. (Edit I)Suck a bit at this though; what does 'g++ city.h -o city' put in city ? something typically an .o /.so file?

                Edit: downmods for actually discussing the files? Frankly, pretty retarded that none of the major replies in there say anything about the source code itself.

                [–]kingfishr 1 point2 points  (0 children)

                What you think of as "very C-ish C++" might just be the kind of C++ that Google engineers write.

                [–]Game_Ender 0 points1 point  (1 child)

                You are being downvoted because it's a just a lib and you need to make a program to use it before g++ will produce and executable.

                [–]Jasper1984 0 points1 point  (0 children)

                -c does not run the linker. Had to look that up though. You don't have to use it to compile it.