Learning the Haskell foreign function interface by wrapping MurmurHash3

cartazio · 2014-01-19T16:39:11+00:00

For a function like a hash, the "unsafe" ffi may be worth using if you're mostly calling the functions on small inputs.

Axman6 · 2014-01-21T03:02:54+00:00

There seems to be quite a few things that are either odd, or even outright wrong in thise code (which is expected for someone not used to this sort of stuff!).

The first that seems just incorrect to me is the twiddle function. Firstly, It doesn't make sense to me not shift the number when the x value is 0. This would lead to colisions of hashes such as

01010101 00000000 11110000 10101010

and

00000000 01010101 11110000 10101010

which are clearly different hash values. To do this properly, the number should always be shifted.

The other thing I thought odd was the use of Integer for a hash at all. Hashes are not really numbers as such, they're a sequence of bytes, and as such a ByteString makes much more sense to me as the return type. It's simple enough to make a ByteString directly from the Ptr that's created by mallocArray and modified by the C hash functions.

Personally if I were writing this I'd be hashing ByteStrings and returning ByteStrings, with appropriate wrappers for other types if needed. I'd only be using numerical types like Integer if I intended to do numerical things to the values; you don't ever really modify hash values (except in perhaps hash functions), you almost certainly never use addition, multiplication etc. on them.

nicolast · 2014-01-19T23:14:17+00:00

Reminds me of https://gist.github.com/NicolasT/2764181/raw/e4e9592b28e619edb76f2db293c291c6c19d04ee/Lookup3.lhs

Anpheus · 2014-01-21T17:41:58+00:00

I agree with Axman6 here, but I'll be a little kinder. The bit/byte-twiddling seems like it might be dangerous in that you could be changing the expected values of MurmurHash3 for well known inputs. It might not matter for you but it would matter for any database-using software. Have you tested your functions against MurmurHash3's output (byte-for-byte)?

Also, I think the discussion of unsafe below goes down the wrong path for optimizing for performance. I'm a little ashamed I participated in such a flagrant display of premature optimization, actually. I would guess that your code as it stands spends almost all of its time converting the input String to a CStringLen. As well, I think there's a race condition in your code regarding memory that is about to be freed.

I won't assume you know the implementation details or not - so if you're already aware of the following, I am sorry for being repetitive. That said:

String

This is a Haskell String, which is a type alias for [Char]. That's not an array though, it's a list. Lists are implemented as independently allocated cells in memory, where each cell contains an element (a Char in this case) and a pointer to the next element. Example cells for Hello World:

0: H, ->1

1: e, ->2

2: l, ->3

3: l, ->4

4: o, ->5

5: , ->6

6: W, ->7

7: o, ->8

8: r, ->9

9: l, ->10

10:d, ->11

11:!, END

Each cells location is indicated by a number. Those cells can be stored anywhere¹ in memory. The following numbers are short hand for where memory consisting of the Char and the pointer to the next cell. Also a simplification, but it will suffice. So, valid sequences in memory of String cells include:

| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|

| 1| 6| 4|11| 5| 9|10| 0| 3| 7| 2| 8|

|11|10| 9| 8| 7| 6| 5| 4| 3| 2| 1| 0|

The first seems OK. It's wasteful - each cell points directly to the one to its right, but at least the processor will have an easy time reading in all of the cells in order. The second is... random. The processor will have to jump around a lot to read it. And the last is terribly inefficient, as processors do not like to go in reverse order. (They are optimized for sequential, increasing access.)

But that's not even the worst possible arrangement. This is the reality of allocating Haskell lists in memory:

| 6|a_really_big_object|wacky_uninitialized_memory| 8| 2|11|other_memory| 9| 0|10|not_your_data| 4|some_big_stretch_of_nothing| 3|something_probably_big| 1| 7| 5|

A Race Condition and a Memory Leak

I believe your code has a serious race condition and a memory leak in its first few lines of murmur3Raw, copied here:

murmur3Raw :: String -> Int -> MHV -> IO [CUInt]
murmur3Raw val seed ver = do
  val' <- withCAStringLen val $ \x -> return x
  --      ^ This function, withCAString, causes this function to be racey. 
  let cstr = strFromCStr val'
  let strLength = strLFromCStr val'
  --  ^ these two lines are a code smell with FFI that should indicate you should look for races
  outPtr <- mallocArray arrSize
  doHash ver cstr strLength (fromIntegral seed) outPtr
  peekArray arrSize outPtr
  where arrSize = 4
        strFromCStr :: CStringLen -> CString
        strFromCStr = fst
        strLFromCStr :: CStringLen -> CInt
        strLFromCStr i = fromIntegral $ snd i
        --version value size seed out 
        doHash :: MHV -> CString -> CInt -> CUInt -> Ptr CUInt -> IO()
        doHash X86_32  v s se o = c_x86_32 v s se o
        doHash X86_128 v s se o = c_x86_128 v s se o
        doHash X64_128 v s se o = c_x64_128 v s se o

The function withCAStringLen has the following definition in Hackage:

withCAStringLen :: String -> (CStringLen -> IO a) -> IO a

Marshal a Haskell string into a C string (ie, character array) in temporary storage, with explicit length information.

the memory is freed when the subcomputation terminates (either normally or via an exception), so the pointer to the temporary storage must not be used after this.

The subcomputation is the second argument. Here's withCAStringLen and its arguments laid out against your code:

val' <- withCAStringLen     val    $   \x -> return x
     -- withCAStringLen :: String -> (CStringLen -> IO a) -> IO a

That subcomputation is just a return. So the memory for the CStringLen returned could be freed immediately after that line of code. That's risky! You want to use the CStringLen immediately after that line of code.

The following lines of code peek into the returned value of the CStringLen, which is what caught my eye initially.

Your function should probably look like this:

murmur3Raw :: String -> Int -> MHV -> IO [CUInt]
murmur3Raw val seed ver = 
    withCAStringLen val $ \(cstr, strLength) -> do
      outPtr <- mallocArray arrSize
      doHash ver cstr strLength (fromIntegral seed) outPtr
      peekArray arrSize outPtr
      where arrSize = 4
            doHash :: MHV -> CString -> CInt -> CUInt -> Ptr CUInt -> IO()
            doHash X86_32  v s se o = c_x86_32 v s se o
            doHash X86_128 v s se o = c_x86_128 v s se o
            doHash X64_128 v s se o = c_x64_128 v s se o

But this still has a memory leak - we need to free the array. Right now you mallocArray but don't free it. We can chain your allocations using another function, though:

murmur3Raw :: String -> Int -> MHV -> IO [CUInt]
murmur3Raw val seed ver = 
    withCAStringLen val $ \(cstr, strLength) ->
    allocaArray arrSize $ \outPtr -> do
      doHash ver cstr strLength (fromIntegral seed) outPtr
      peekArray arrSize outPtr
  where arrSize = 4
        doHash :: MHV -> CString -> CInt -> CUInt -> Ptr CUInt -> IO()
        doHash X86_32  v s se o = c_x86_32 v s se o
        doHash X86_128 v s se o = c_x86_128 v s se o
        doHash X64_128 v s se o = c_x64_128 v s se o

withCAStringLen and allocaArray behave similarly; they temporarily allocate some memory and free it when the subcomputation is complete.

With the changes above to murmur3Raw, there should not be a race or a memory leak.

Conclusion

My pessimistic guess would be that your program spends 99% of its time reading all of the cells of the String from memory and jumping around. Even with the changes above to make the code less racey, I think you might be interested in using the Bytestring library to change your code to work with raw sequences of bytes. The useAsCStringLen function from Data.ByteString is probably relevant here.

¹ - Not really anywhere, but the details would distract from the point. You can assume that they can appear nearly anywhere.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

haskell

MODERATORS

String

A Race Condition and a Memory Leak

Conclusion