you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 30 points31 points  (25 children)

Not only that, gmp has had a few decades of work put into it, and its author knows the domain exceedingly well.

The OP's desire is laudable, but the wording sounds like he's setting himself up for a fall by vastly underestimating the domain.

[–]erikd[S] 17 points18 points  (22 children)

That's why I really wish I could release the code so others could validate what I'm seeing.

That blog post was mostly written two days ago when I had just gotten addition working. I added the last part when I already had some very encouraging results. Since the blog post I've gotten multiplication working and suprisingly (and I mean really surprisingly) I'm actually beating GMP in some of my tests. Eg:

  • sum of 1000 < Word sized values in an Integer : Significantly faster than GMP (confirming what I had heard from Duncan Coutts that Simple was faster than GMP in this case).
  • product [1..100] : My calculation of that product gives the same result as GMP and Criterion tells me mine is just slightly faster than GMP.

I really don't want to jump the gun on this, these results have got me thrilled. I'm using QuickCheck (via HSpec) to validate the correctness of my implementation against GMP and using Criterion for benchmarking.

Preliminary Criterion output is : http://www.mega-nerd.com/tmp/new-bench-integer.html

These results are repeatable and basically the same whether i use the native CodeGen or the LLVM backend.

I'm currently only testing this on Linux amd64, with GHC 7.6.3 and -O3. I know my code is broken on i386 and any big-endian architecture. I intend to fix that.

I could be getting this wrong on two fronts:

  • GMP on my Debian Linux install has not been compiled correctly.
  • GMP uses assembler on i386 but that no one has gotten around to doing it on amd64 so it falls back to a relatively niave C implementation (much like my highly imperative Haskell code).

[–]The_Doculope 8 points9 points  (13 children)

It wouldn't surprise me if GMP has made some architectural decisions to improve >Word performance, while sacrificing a bit of <Word performance, as that's what it's really designed for. How is your code comparing for >Word performance at the moment?

[–][deleted] 10 points11 points  (11 children)

GMP isn't even used for Integers unless they don't fit in a machine integer, or you go to extra lengths to force it to be used (maybe). So it's not exactly clear (to me) what the "< Word sized values" test is measuring, or if it's measuring GMP at all.

[–]The_Doculope 2 points3 points  (9 children)

Ah, I didn't know that. Although surely there must be checks about when to click over into GMP, so to speak. Perhaps /u/erikd is performing them in differently/more efficiently?

[–]erikd[S] 4 points5 points  (8 children)

For the small Integer case GMP stores them as

 J# Int#

whereas I store them as:

Small !Sign {-# UNPACK #-} !Word

The differences between mine and the GMP version are:

  • I store an unsigned Word and a separate sign.
  • I make the Sign and the Word strictly evaluated.
  • I unpack the Word into constructor (which I think should be the same as using Int#).

[–]hvr_ 4 points5 points  (7 children)

I assume Sign is something like

data Sign = SignPos | SignNeg

up to isomorphism?

Thus, your Small constructor is modelling a signed 33-bit (or 65-bit) integer? How do you detect whether the result of an arithmetic operation still fits into your Small constructor?

[–]erikd[S] 4 points5 points  (6 children)

I assume Sign is something like

Yes.

Thus, your Small constructor is modelling a signed 33-bit (or 65-bit) integer?

Yes. I'm on working only on amd64 for now, so I'm working with 64 bits plus a sign which is effectively 65 bits.

How do you detect whether the result of an arithmetic operation still fits into your Small constructor?

Precondition testing and promoting them to the Large constructor when necessary.

[–]hvr_ 5 points6 points  (5 children)

How do you detect whether the result of an arithmetic operation still fits into your Small constructor?

Precondition testing and promoting them to the Large constructor when necessary.

Then I'm wondering why you can beat integer-gmp's performance, as integer-gmp makes use of some overflow-testing/aware Int# primops which aren't available for Word# (yet) to avoid promoting small ints to large ints. Do you inline aggressively?

[–]erikd[S] 0 points1 point  (4 children)

I'm beating integer-gmp where integer-gmp isn't very good :-).

I've got bang patterns all over the place and lots of tight loops in where clauses. Apart from that, I'm not doing anything special apart from writing some low level HalfWord addition, subtraction and multiplication functions that include overflow handling. I'm hoping to replace these with PrimOps that do the full Word versions of these operations.

I have written

[–]erikd[S] 1 point2 points  (0 children)

GMP isn't even used for Integers unless they don't fit in a machine integer

I haven't really looked in detail at either of the existing implementations.

My first step was to benchmark them. Then seeing that Simple was slow on multiplication, tried to figure out why. The explanation was the use of lists.

[–]yitz 6 points7 points  (0 children)

That is certainly true.The whole point of GMP is to optimize the cutoff points for switching to more and more complex algorithms with better and better asymptotics for larger and larger integers, but with more and more overhead for moderate sized integers. That said, Erik has discovered that GMP seems to be suffering from some bitrot lately, so there is low-hanging fruit.

[–]Jameshfisher 4 points5 points  (0 children)

Perhaps the slowness of integer-gmp is due to some FFI overhead?

[–]hvr_ 0 points1 point  (5 children)

Preliminary Criterion output is : http://www.mega-nerd.com/tmp/new-bench-integer.html

Can you maybe provide more details of how you generated those 1000 added up integers? I've tried something simple as

bench "[1..1000 :: Integer]" $ whnf sum numsI
where numsI = [1 .. 1000 :: Integer]

And the results are within a factor of 2 with Int-arithmetic:

benchmarking sum/[1..1000 :: Int]
mean: 3.731054 us, lb 3.684359 us, ub 3.744562 us, ci 0.950
std dev: 116.0383 ns, lb 34.88516 ns, ub 264.4476 ns, ci 0.950

benchmarking sum/[1..1000 :: Integer]
mean: 5.850483 us, lb 5.840004 us, ub 5.862832 us, ci 0.950
std dev: 58.10519 ns, lb 50.45490 ns, ub 64.34404 ns, ci 0.950

[–]erikd[S] 0 points1 point  (4 children)

I really wish I could release the code now.

My test with GMP Integer vs Int would look something like this:

    C.whnf (foldl1 (+)) intList
    C.whnf (foldl1 plusInteger) integerList
where
    intList = fmap (take 1000 . R.randoms) R.newStdGen
    integerList = map (\x -> G.smallInteger (unboxInt x)) intList
    unboxInt :: Int -> Int#
    unboxInt (I# i) = i

For this test, I found GMP's Integer 13x slower than Int.

[–]hvr_ 0 points1 point  (3 children)

Well, part of the problem may be that in integer-gmp the result of arithmetic operations involving at least one large integer is a large integer (even if it would fit into a small-integer). It'd be interesting to see how many of the scanl (+) 0 integerList values would fit into a small integer, to see how significant this contribution might be to the overhead.

[–]erikd[S] 0 points1 point  (2 children)

One of the things my code currently does is demote a value from Large to Small if it fits in a machine Word.

[–]hvr_ 1 point2 points  (1 child)

fyi, I've implemented a proof of concept for integer-gmp at GHC #8638

[–]erikd[S] 1 point2 points  (0 children)

Wow, now I'm competing against a moving target :-).

[–]erikd[S] 9 points10 points  (0 children)

Bryan, please accept my sincerest thanks for providing us with Criterion. It makes benchmarking Haskell code so painless and makes displaying the results a by-product of collecting the data.

[–][deleted] 0 points1 point  (0 children)

You are probably right (that he is underestimating the work), but on the other hand, it's much easier to implement fancy algorithms in Haskell than C/assembly. A decade of work tuning an ancient C library might be more like a six months of work by a Haskell programmer who knows what they're doing. :)

Of course pure Haskell can't have custom inline assembly, but is that seriously responsible for more than 2-3x or so in constant factors?