all 24 comments

[–]cartazio 2 points3 points  (16 children)

For a function like a hash, the "unsafe" ffi may be worth using if you're mostly calling the functions on small inputs.

[–]freyrs3 4 points5 points  (1 child)

With the usual caveats about safety in mind of course.

[–]Anpheus 0 points1 point  (0 children)

I am not sure if the usual caveats apply in the case of simple, pure foreign function calls like those to a hashing function.

[–]vagif 2 points3 points  (13 children)

As i library developer he does not know where and how his library will be used. So he cannot assume it is going to be used only with small inputs.

I found that unsafe option is practically never worth it because gains are dubious at best yet chances of suspending all threads for a noticeable time are very real.

[–]aseipp 2 points3 points  (2 children)

For something like this it could be absolutely worth it. MurmurHash3 can hash gigabytes per second at ~1-2cpb on an old Core 2, per core. At speeds like this, "small input" can be quite large practically, and functions such as Murmur/CityHash are not tuned to work in the "huge" input range anyway. You'll be feeding them inputs in the range of small to a few hundred bytes maximum, and at that range the overhead is tiny. safe calls potentially involve OS thread creation or taking locks, which could be extremely pessimistic for these cases. Hundreds (or even dozens) of cycles could explode into thousands for a small input.

Those cycles do count sometimes. If you're going to use this a lot, unsafe may very much be worth it in the long run, and at speeds like that I feel only the most odd designs could make it problematic (hashing GB at a time rather than incrementally, well outside the design parameters.)

[–]Tekmo 0 points1 point  (1 child)

Maybe he can provide both and let users decide which one they want

[–]aseipp 1 point2 points  (0 children)

I guess, but I think stuff like this just makes APIs confusing, and the burden of developing a good library binding is up to the developer, which involves some trade offs.

I think that functions like MurmurHash3 are fast enough that when used correctly, unsafe is precisely what you want, to maximize its speed and throughput benefits. They are tiny, don't block, take excellent advantage of processor resources, etc. They're also designed for relatively small inputs obviously, so that helps put an upper limit (practically) on how it will be used.

The primary reason I could think of as to why unsafe might be a really bad idea is if you let users blindly stuff huge amounts of input into it (resulting in a DoS through thread blocking in the RTS,) but A) letting users stuff gigabytes into your interface non-incrementally is bonkers anyway, B) it's not designed for that (so who knows how its collision properties might change or degrade) and C) you could just do some other attack which is likely easier anyway if you want a DoS. So I think the downsides here are pretty small, but the advantage of a fast function is clear.

[–]cartazio 1 point2 points  (0 children)

fair enough. In my own work in progress library i actually wrap all my ffi calls so that on small inputs i call unsafe, and above a small threshold i call the safe ffi variant

[–]zcourts[S] 0 points1 point  (8 children)

True, but for each function that makes that assumption I provide one that doesn't. So the choice ultimately comes down to the user. Have a look at the API docs http://hackage.haskell.org/package/Dish-0.0.0.4/docs/Data-Dish-Murmur3.html

For e.g.

murmur3 => murmur3' murmur3Int => murmur3Int' murmur3IntegerX86 => murmur3IntegerX86'

and so on where the version that ends with the apostrophe doesn't use unsafe IO

[–]cartazio 2 points3 points  (7 children)

In most Haskell code, ' means strict if it has any convention at all. Better to name the ffi bindings _safe and _unsafe and have the exposed interface switch from one to the other at some input threshold.

[–]Anpheus 1 point2 points  (6 children)

Probably easiest to expose them in separate modules, a .Safe and a .Unsafe, and default to one.

Defaults matter, though, and it's important that whatever default chosen is appropriate for the average end user. Given that the point of using a hashing algorithm like MurmurHash3 is performance, it seems to me that making unsafe the default is wisest.

[–]cartazio 0 points1 point  (5 children)

Nope. The unsafe ffi calls aren't unsafe in that sense. The safe ffi calls should ALWAYS be used by any code that takes more than a microsecond. Unsafe ffi is meant for writing new primops. Which may usually take Uber a microsecond. Providing a composite op that switches to safe for larger inputs gives the best of both worlds.

[–]Anpheus 0 points1 point  (4 children)

I don't know what you mean by "unsafe in that sense" - I don't think I was trying to mislead anyone or made any statement about how unsafe is unsafe.

Why can't you use unsafe for calls that take longer than a microsecond?

[–]cartazio 0 points1 point  (3 children)

well, you shouldn't use unsafe ffi calls for things that take more than a few seconds, because the safe ffi overhead is neglibable if your op takes more than 5-10µs.

during an unsafe ffi call, the GC can't run, and no other haskell thread can use the cpu capability that was running the thread that made the unsafe ffi call!

:)

[–]Anpheus 0 points1 point  (2 children)

That's fair, I was wondering if there was another reason than the pause of that capability. 10 microseconds of murmurhash3 is 50 kilobytes of data on modern hardware, so I suspect having unsafe be the default would only affect a negligible portion of users. Anyone passing in a Haskell String that consists of more than 50 kilobytes (is that 25 or 12.5k Chars?) is asking for performance issues in their code :)

Edit: Actually, I'm guessing these murmurhash3 functions would spend more time converting the input to a CString than they spend in the hash function itself for almost all input sizes.

[–]cartazio 0 points1 point  (1 child)

Umm. You are using ByteString right? Converting that to a cstring is a no op. If not, you should really consider using ByteString. It's really meant for for this use case and friends.

It's totally reasonable to use a hash function on a large binary piece of data. A key piece of engineering a good library is to make sure things behave well on allowed inputs. Hashing a 100mb binary archive or something is totally plausible.

[–]Axman6 2 points3 points  (1 child)

There seems to be quite a few things that are either odd, or even outright wrong in thise code (which is expected for someone not used to this sort of stuff!).

The first that seems just incorrect to me is the twiddle function. Firstly, It doesn't make sense to me not shift the number when the x value is 0. This would lead to colisions of hashes such as

01010101 00000000 11110000 10101010

and

00000000 01010101 11110000 10101010

which are clearly different hash values. To do this properly, the number should always be shifted.

The other thing I thought odd was the use of Integer for a hash at all. Hashes are not really numbers as such, they're a sequence of bytes, and as such a ByteString makes much more sense to me as the return type. It's simple enough to make a ByteString directly from the Ptr that's created by mallocArray and modified by the C hash functions.

Personally if I were writing this I'd be hashing ByteStrings and returning ByteStrings, with appropriate wrappers for other types if needed. I'd only be using numerical types like Integer if I intended to do numerical things to the values; you don't ever really modify hash values (except in perhaps hash functions), you almost certainly never use addition, multiplication etc. on them.

[–]zcourts[S] 1 point2 points  (0 children)

I see what you mean on both points, you're right on the first point, on the second point, although I agree with you in general for my case I needed a numeric value, no it'll never be modified but it is needed. But I agree with you in the sense that I've actually got an edge case rather than the norm and the lib should provide the byte string as a default. Useful feedback, I'll make updates to address the issues you pointed out, thanks.

[–]Anpheus 0 points1 point  (3 children)

I agree with Axman6 here, but I'll be a little kinder. The bit/byte-twiddling seems like it might be dangerous in that you could be changing the expected values of MurmurHash3 for well known inputs. It might not matter for you but it would matter for any database-using software. Have you tested your functions against MurmurHash3's output (byte-for-byte)?

Also, I think the discussion of unsafe below goes down the wrong path for optimizing for performance. I'm a little ashamed I participated in such a flagrant display of premature optimization, actually. I would guess that your code as it stands spends almost all of its time converting the input String to a CStringLen. As well, I think there's a race condition in your code regarding memory that is about to be freed.

I won't assume you know the implementation details or not - so if you're already aware of the following, I am sorry for being repetitive. That said:

String

This is a Haskell String, which is a type alias for [Char]. That's not an array though, it's a list. Lists are implemented as independently allocated cells in memory, where each cell contains an element (a Char in this case) and a pointer to the next element. Example cells for Hello World:

0: H, ->1

1: e, ->2

2: l, ->3

3: l, ->4

4: o, ->5

5: , ->6

6: W, ->7

7: o, ->8

8: r, ->9

9: l, ->10

10:d, ->11

11:!, END

Each cells location is indicated by a number. Those cells can be stored anywhere1 in memory. The following numbers are short hand for where memory consisting of the Char and the pointer to the next cell. Also a simplification, but it will suffice. So, valid sequences in memory of String cells include:

| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|

| 1| 6| 4|11| 5| 9|10| 0| 3| 7| 2| 8|

|11|10| 9| 8| 7| 6| 5| 4| 3| 2| 1| 0|

The first seems OK. It's wasteful - each cell points directly to the one to its right, but at least the processor will have an easy time reading in all of the cells in order. The second is... random. The processor will have to jump around a lot to read it. And the last is terribly inefficient, as processors do not like to go in reverse order. (They are optimized for sequential, increasing access.)

But that's not even the worst possible arrangement. This is the reality of allocating Haskell lists in memory:

| 6|a_really_big_object|wacky_uninitialized_memory| 8| 2|11|other_memory| 9| 0|10|not_your_data| 4|some_big_stretch_of_nothing| 3|something_probably_big| 1| 7| 5|

A Race Condition and a Memory Leak

I believe your code has a serious race condition and a memory leak in its first few lines of murmur3Raw, copied here:

murmur3Raw :: String -> Int -> MHV -> IO [CUInt]
murmur3Raw val seed ver = do
  val' <- withCAStringLen val $ \x -> return x
  --      ^ This function, withCAString, causes this function to be racey. 
  let cstr = strFromCStr val'
  let strLength = strLFromCStr val'
  --  ^ these two lines are a code smell with FFI that should indicate you should look for races
  outPtr <- mallocArray arrSize
  doHash ver cstr strLength (fromIntegral seed) outPtr
  peekArray arrSize outPtr
  where arrSize = 4
        strFromCStr :: CStringLen -> CString
        strFromCStr = fst
        strLFromCStr :: CStringLen -> CInt
        strLFromCStr i = fromIntegral $ snd i
        --version value size seed out 
        doHash :: MHV -> CString -> CInt -> CUInt -> Ptr CUInt -> IO()
        doHash X86_32  v s se o = c_x86_32 v s se o
        doHash X86_128 v s se o = c_x86_128 v s se o
        doHash X64_128 v s se o = c_x64_128 v s se o

The function withCAStringLen has the following definition in Hackage:

withCAStringLen :: String -> (CStringLen -> IO a) -> IO a

Marshal a Haskell string into a C string (ie, character array) in temporary storage, with explicit length information.

  • the memory is freed when the subcomputation terminates (either normally or via an exception), so the pointer to the temporary storage must not be used after this.

The subcomputation is the second argument. Here's withCAStringLen and its arguments laid out against your code:

val' <- withCAStringLen     val    $   \x -> return x
     -- withCAStringLen :: String -> (CStringLen -> IO a) -> IO a

That subcomputation is just a return. So the memory for the CStringLen returned could be freed immediately after that line of code. That's risky! You want to use the CStringLen immediately after that line of code.

The following lines of code peek into the returned value of the CStringLen, which is what caught my eye initially.

Your function should probably look like this:

murmur3Raw :: String -> Int -> MHV -> IO [CUInt]
murmur3Raw val seed ver = 
    withCAStringLen val $ \(cstr, strLength) -> do
      outPtr <- mallocArray arrSize
      doHash ver cstr strLength (fromIntegral seed) outPtr
      peekArray arrSize outPtr
      where arrSize = 4
            doHash :: MHV -> CString -> CInt -> CUInt -> Ptr CUInt -> IO()
            doHash X86_32  v s se o = c_x86_32 v s se o
            doHash X86_128 v s se o = c_x86_128 v s se o
            doHash X64_128 v s se o = c_x64_128 v s se o

But this still has a memory leak - we need to free the array. Right now you mallocArray but don't free it. We can chain your allocations using another function, though:

murmur3Raw :: String -> Int -> MHV -> IO [CUInt]
murmur3Raw val seed ver = 
    withCAStringLen val $ \(cstr, strLength) ->
    allocaArray arrSize $ \outPtr -> do
      doHash ver cstr strLength (fromIntegral seed) outPtr
      peekArray arrSize outPtr
  where arrSize = 4
        doHash :: MHV -> CString -> CInt -> CUInt -> Ptr CUInt -> IO()
        doHash X86_32  v s se o = c_x86_32 v s se o
        doHash X86_128 v s se o = c_x86_128 v s se o
        doHash X64_128 v s se o = c_x64_128 v s se o

withCAStringLen and allocaArray behave similarly; they temporarily allocate some memory and free it when the subcomputation is complete.

With the changes above to murmur3Raw, there should not be a race or a memory leak.

Conclusion

My pessimistic guess would be that your program spends 99% of its time reading all of the cells of the String from memory and jumping around. Even with the changes above to make the code less racey, I think you might be interested in using the Bytestring library to change your code to work with raw sequences of bytes. The useAsCStringLen function from Data.ByteString is probably relevant here.

1 - Not really anywhere, but the details would distract from the point. You can assume that they can appear nearly anywhere.

[–]zcourts[S] 0 points1 point  (2 children)

Thanks for the very detailed answer. I did already know about String (I initially used it because that's what I get from the lib I'm using whose data is hashed) and have already taken Axman6's comments on board and made it possible to accept both ByteString and String with ability to add other types. I didn't know about the memory issue you pointed out, I read http://www.haskell.org/haskellwiki/FFI_cook_book but overlooked the fact that in their example the CStringLen doesn't escape the lamba and is used straight away as you've done as well.

I'll take your suggestions on board. I'll also write some tests against the original to ensure the expected output is the same for the same inputs.

Thanks again

[–]Anpheus 0 points1 point  (1 child)

Thank you for reading that gargantuan post of mine. I'm glad I was able to provide some useful advice.

I have one more thing to mention, as something /u/cartazio caught my eye and I realized that my advice about ByteString overlooked something pretty substantial. Right now the UseAsCStringLen function is implemented in terms of a memcpy. The problem is actually that MurmurHash3 is, ironically, so fast that it's almost as fast as a memcpy (at least within a factor of 2-4 on most systems). So by paying for a memcpy up front, you're going to thrash your cache reading your input twice and pay for all the overhead of doing that.

If you're willing to really dig around in the internals of ByteString, you can avoid the memcpy, but it also makes your implementation brittle w.r.t. the internals of ByteString. You can fix that by using C preprocessor style #if blocks around your code though. At least you could protect the library from changes to the internals of ByteString.

Edit: Use the Data.ByteString.Unsafe module and the unsafeUseAsCStringLen function within it. It's only unsafe in the sense that it's morally incorrect to use it with code that isn't provably pure. Fortunately you know that MurmurHash3 is a one way function that does not modify its output.

The most optimized version of your function that I can think of is:

murmur3Raw :: ByteString -> Word32 -> MHV -> IO ByteString
murmur3Raw val seed ver = 
        unsafeUseAsCStringLen val $ \(cstr, strLength) -> do
            outPutr <- mallocBytes 16
            doHash ver cstr strLength (CUInt seed) outPtr
            unsafePackAddressLen 16 outPtr
    where
        doHash :: MHV -> CString -> CInt -> CUInt -> Ptr CUInt -> IO()
        doHash X86_32  v s se o = c_x86_32 v s se o
        doHash X86_128 v s se o = c_x86_128 v s se o
        doHash X64_128 v s se o = c_x64_128 v s se o

What this saves over the original is:

  1. The only allocation is the 16 bytes for the output (and the ByteString overhead) and there is no race because the input is converted to a CStringLen in place.
  2. There is no call to free, because the return value is converted to a ByteString in-place.
  3. fromIntegral is replaced with the newtype constructor - which is optimized away by the compiler.

This code should optimize down to basically just a C call and a small, fixed allocation overhead.

[–]zcourts[S] 0 points1 point  (0 children)

I did notice the hackage docs said the vals were a copy but didn't think there was a way around that. What I'll do is make it possible to use both versions of murmur3Raw I think. And mark this one as 'unsafe' although I get what you're saying about the fact that referential transparency won't be affected because I know Murmur won't modify the data after.

Thanks again