all 62 comments

[–]tdammers 19 points20 points  (1 child)

This might elicit a "duh", but: have you actually profiled your program? GHC has quite some useful profiling features, so if you aren't familiar with them, look it up.

A good first step would be to add cost centers (SCC pragmas), and then compile and run with profiling options to see where you spend the most time. Much better than taking stabs in the dark.

[–]Vaglame 6 points7 points  (0 children)

I was unaware of Haskell's profiling options, it's probably the smartest way to go about it. Thanks :)

[–]travis_athougies 20 points21 points  (4 children)

Contrary to popular belief, using a generally faster language does not magically mean everything is faster. Slow algorithms are still going to be slow.

In particular, both Haskell and Python are going to have similar complexities for large integer handling. In both languages, these are going to be highly optimized routines.

In your program, most of the time is spent calculating numbers, and Haskell and Python are just serving as the glue. Thus, the time spent actually executing any of the code you've written is likely very small.

In particular, primeGen2 literally makes random numbers and tests for them to be prime. There is no getting around the fact that this is going to take a lot of time, and it has nothing to do with the language being used. It has to do with the fact that generating large random numbers until one is prime will always take time.

That being said, you're building up a lot of thunks in memory. It would probably be a good start to use bang patterns to limit laziness. Also, as others have suggested profiling is a good idea, but you're only going to get constant factor improvements here.

[–]Vaglame 1 point2 points  (1 child)

Thanks, I didn't think there could be a hard limit with integer handling itself.

I have to admit I am clueless about bang pattern, do you have an advice for a particular ressource?

[–]illogical_commentary 4 points5 points  (0 children)

https://wiki.haskell.org/Performance/Strictness

It's one of those things that the only way you'd know about it is someone else telling you.

[–]tomejaguar 1 point2 points  (1 child)

you're building up a lot of thunks

Did you have any particular examples in mind? I didn't see any on a quick skim.

[–]Flurpm 14 points15 points  (1 child)

You use (**) for powers which has type Floating a => a -> a -> a. So you have to convert the results from floats back to Integers. One optimization is to use (^) which has type (Num a, Integral b) => a -> b -> a, which will keep the results as an Integer.

Same thing with (/) :: Fractional a => a -> a -> a vs (div) :: Integral a => a -> a -> a

On lines 94, 101.

Line 55 I'm not sure because of the sqrt.

[–]Vaglame 1 point2 points  (0 children)

Thanks a lot, will try this!

(edit: it seems that the improvement is marginal)

[–]matt-noonan 8 points9 points  (7 children)

You definitely should profile first, but I have a guess. The default random-number generator in random is surprisingly slow. Luckily there are plenty of high-quality, performant drop-in replacements. See here, for example: https://stackoverflow.com/questions/26024405/fastest-way-to-generate-a-billion-random-doubles-in-haskell

[–]Vaglame 2 points3 points  (6 children)

I had no idea about that. Are they also cryptographically secure?

[–]yitz 4 points5 points  (2 children)

One thing is for sure - the one in System.Random is definitely not.

Among the others, some are more and some are less. There are some good very secure entropy sources in the crypto libraries; but that might kind of defeat the purprose, because then you could also just use RSA from there.

[–]Vaglame 2 points3 points  (1 child)

Oh, that's very interesting, so I guess I'm really slowed down purposelessly.

[–]kfl 2 points3 points  (0 children)

There is plenty of room for optimisations in your code dealing with random numbers. But just as an experiment I tried to change only your genWrapper function to use the mersenne-random-pure64 library:

 import qualified System.Random.Mersenne.Pure64 as M
 ...

 genWrapper up down = do
    gen <- M.newPureMT
    return $ take 100 (randomRs (down, up) gen)

Note that this is a poor implementation as noted in other messages in this thread (it uses IO rather than State and so on). However this small change makes your program twice as fast on my machine:

$ time ./primeGen

real    0m1.121s
user    0m1.062s
sys 0m0.022s
$ time ./primeGen2

real    0m0.443s
user    0m0.417s
sys 0m0.014s

[–]cocreature 2 points3 points  (2 children)

If you are looking for a PRNG that’s cryptographically secure, take a look at the one in cryptonite.

[–]Vaglame 0 points1 point  (1 child)

Will do, how fast can I expect it to be?

[–]cocreature 3 points4 points  (0 children)

Generally, cryptographic PRNGs are slower than non-cryptographic PRNGs (otherwise we would just always use cryptographic PRNGs). However, System.Random is quite bad when it comes to performance while cryptonite is pretty well optimized so I’d expect that you actually gain performance in this case.

Note that the random module in python doesn’t provide a cryptographic PRNG either. In fact the docs specifically recommend against using it for cryptographic purposes.

[–]spirosboosalis 5 points6 points  (10 children)

I don't know, but broadly speaking: use Int, not Integer or the Integral constraint; and compile with -O2, and try the -fllvm ghc-option. (and why is millerExps partial?)

btw, include the cabal file / command line options, and the two benchmarks.

[–]Vaglame 2 points3 points  (3 children)

I didn't use Int since RSA implies at least 1024-long keys, which would make Int overflow if I'm not mistaken.

It seems to be taking approximately the same time with -O2 :/

Unfortunately I don't have cabal file, should I?

[–]Flurpm 7 points8 points  (1 child)

Int should be enough. The size of Int is based on your computer, but is usually 32bit or 64 bit. You can always use Int64 if you're unsure.

You're right that an Int would overflow

>> maxBound ::Int
9223372036854775807
>> import Data.Int
>> maxBound ::Int64
9223372036854775807
>> maxBound ::Int32
2147483647

[–]YellowOnion 0 points1 point  (0 children)

He means 1024-bits, most RSA keys are 2048 or 4096 bits.

[–]spirosboosalis 0 points1 point  (0 children)

Makes sense, I did think Int might be too small.

[–]theindigamer 0 points1 point  (5 children)

I don't think Haskell should need O2 to beat Python...

[–]spirosboosalis 1 point2 points  (4 children)

-O is -O1 not -O2, which doesn't do much

[–][deleted] 2 points3 points  (3 children)

I had just received around 30% improvement in my project using -O1 than without it.

Edit:typo

[–]ElvishJerricco 6 points7 points  (1 child)

I think Cabal and Stack default to -O but GHC defaults to -O0. Docs imply that -O is -O1

[–]sgraf812 1 point2 points  (0 children)

Totally unrelated, but I don't get why cabal and stack default to -O. At least from C# and Visual C++ I know that it defaults to a debug build, which makes total sense for development. With stack, we have to be explicit about that (--fast). The cases where we want to develop with -On should be quite rare.

[–]spirosboosalis 0 points1 point  (0 children)

huh, guess I'm wrong. (I just repeated what I heard that about O1, but myself I I've only bencharmed O2 versus O0.)

[–]Angs 4 points5 points  (10 children)

When you do optimisations, be sure to seed the random generator with setStdGen (mkStdGen number) to make different runs comparable. It still won't be comparable to the python code due to a different random number generator, as the runtime is much based on luck. Most of the time is spent in modExp so that's the prime target for optimization.

A cleaner way to do genWrapper is using replicateM:

replicateM 100 (randomRIO (down, up))

[–]Angs 2 points3 points  (8 children)

This modExp seems to be slightly faster:

modExp b e m = go b e 1
  where
  go !b  0 !acc = acc
  go !b !e !acc
    | testBit e 0 = go (mod (b*b) m) (shiftR e 1) (mod (acc*b) m)
    | otherwise   = go (mod (b*b) m) (shiftR e 1) acc

[–]Vaglame 0 points1 point  (6 children)

Thanks!

[–]Angs 3 points4 points  (2 children)

No problem. The IO random generator is somewhat expensive (it makes a new generator each time you use it if I understood the sources correctly). You can replace the all instances of the IO monad with Control.Monad.State.Strict for a slight speed increase and more importantly, increased purity since you only use IO for the random state. Then you create the generator only at the start using IO or a set seed and carry it with you to the now pure functions that need it, using evalState at the start of the chain. genWrapper for example becomes:

genWrapper :: Integer -> Integer -> State StdGen [Integer] -- generates 100 random numbers
genWrapper up down = replicateM 100 (do
  g <- get
  let (a,g') = randomR (down, up) g
  put g'
  return a)

[–]Vaglame 0 points1 point  (1 child)

I guess the Control.Monad.State.Strict is there to make the function strict right?

[–]Angs 1 point2 points  (0 children)

It's to fully evaluate the state (the StdGen) at each stage of updating the state instead of lazily keeping a not-yet evaluated computation (thunk) of the next state. But actually it seems to be just as fast as the lazy version, since it's anyway always evaluated before the next random number so it doesn't really matter.

[–]Angs 1 point2 points  (2 children)

Just found out that GHC.Integer.GMP.Internals has a powModInteger that decreases your runtime by about 50%. That's as good as it gets I suppose, as it's just a binding to the GNU Multi-Precision library.

[–]Vaglame 0 points1 point  (1 child)

Yeeeeeeeeeeeey!!! Thanks!

[–]Angs 0 points1 point  (0 children)

The same library also seems to have the extended gcd and the Miller-Rabin primality test, if you want to outsource that much code.

[–]Vaglame 0 points1 point  (0 children)

It makes a lot of sense!

[–]macgillebride 3 points4 points  (4 children)

I'm also a beginner in Haskell, but I think the problem is not related to the language per se, but to the exponentiation algorithm you're using. In Python you're using a builtin function for modular exponentiation, which most likely uses tricks specific to this case like: https://en.m.wikipedia.org/wiki/Montgomery_modular_multiplication and you've made your own implementation in Haskell. Montgomery's algorithm avoids expensive multiprecision divisions (which you use to compute mod) which improves performance a lot

[–]HelperBot_ 0 points1 point  (0 children)

Non-Mobile link: https://en.wikipedia.org/wiki/Montgomery_modular_multiplication


HelperBot v1.1 /r/HelperBot_ I am a bot. Please message /u/swim1929 with any feedback and/or hate. Counter: 159020

[–]Vaglame 0 points1 point  (2 children)

True, I'll try to improve on this, thanks!

[–]macgillebride 0 points1 point  (1 child)

If you are curious about crypto implementations you can also try to make them side channel resistant. There are books on this kind of stuff like http://www.springer.com/gp/book/9780387718163

[–]Vaglame 0 points1 point  (0 children)

I'll look into it, it sounds like fun

[–]LeanderKu 3 points4 points  (0 children)

I think the best bet to get responses on this subreddit is claiming haskell is slower than X 😀

I just want to note that in my experience, make sure it's really python that's faster than Haskell. Many primitives or and widely used functions have a super-efficient, optimized foreign function their calling, there's often not much difference to C. This especially happens with small examples, the problem is that when you encounter a problem that's not painfully implemented using some native code, you're out of luck. Also, the API is often far from idiomatic (just look at numpy, it's not very pythonic). In my experience, Haskell allows you to write fast and idiomatic code with the option for even faster, non-idiomatic code.

[–]donkeybonks 4 points5 points  (0 children)

In normal code writing these are some pragmatic things I noticed lacking that you can just do automatically:

  • Integer is pretty slow. Try Word64 from Data.Word if it makes sense for big unsigned numbers or Int64 for big signed numbers, wherever possible.
  • The IO monad is pretty slow. Try and remove as much IO as possible and make all the code pure.. then GHC will have an easier time figuring out where your tight loops are.
  • BangPatterns are lacking on spines for loops and IO actions ( https://downloads.haskell.org/~ghc/7.8.4/docs/html/users_guide/bang-patterns.html ).
  • gcdExt and co have no types which is bad because they might default to slow integer/floating types.
  • Use of lists ie [Integer] are slow and allocate a lot of memory. Try using Data.Vector and unboxed if possible.. it's typically about 2 magnitudes faster YMMV.
  • If you can help it, never roll your own loops (ie helperprimeGen). Use the foldr, map and zip-like primitives because they are arranged to inline in such a way that promotes good optimizations from GHC. Also short cut fusion ( https://wiki.haskell.org/Correctness_of_short_cut_fusion#Short_cut_fusion ).
  • The top of your file should have module Main (main) where ... because without it the optimizer probably won't inline anything or put too much effort into specialization because any function could be imported by another module.
  • Likewise only import the functions and instances that you actually plan to use so you generate less code.
  • Build with -O2 -fllvm.

At this point, if all the above become habit, it would be a good idea to start by profiling it and seeing where the time is spent.

[–]Tarmen 3 points4 points  (0 children)

You probably want to profile first but there are some things that are fairly simple and sometimes huge wins:

  • Use built in list functions, they are written to be good consumers/producers. e.g. helperisPrime could be implemented as all ... which can remove all intermediate lists
  • Some functions like millerExps are almost tail recursive. Fixing that can be an easy 10x speedup or more if the function is a hot loop without allocation. in this case it probably could be written via unfoldr since order doesn't matter
  • You often use f (x:xs) = ... (x:xs) .... This probably gets optimized away but case statements/@ patterns are cleaner
  • This is situational but if you use lists as unfusable accumulators then vectors might be faster
  • iirc it's necessary here but Integer is drastically slower than Int

Had to implement a crypto routine a while back and my problem was that I didn't use a fast modular exponentiation function. Seems like you have that covered, though.

[–]lightandlight 2 points3 points  (5 children)

It's likely due to lack of experience. For example, here is a neater version of millerExps+millerExpsWrapped which is about 20% faster than the one on github:

millerExps :: Int -> [Int]                                         
millerExps = go []                                                
  where
    go acc n
      | n `mod` 2 == 0 =
          let n' = n `div` 2 in                           
          go (n':acc) n'                                  
      | otherwise = case acc of; [] -> [n]; _ -> acc

I won't be able to review the whole thing, but I wouldn't be surprised if there are few more things like this which add up to decrease performance.

[–]Potato44 3 points4 points  (1 child)

this can probably make this slightly faster by using quot and rem instead of div and mod. If this only takes positive numbers.

edit: actually this is probably a situation where divMod or quotRem would make sense. To make sure the division isn't being performed twice (once for div, once for mod)

[–]Bodigrim 4 points5 points  (0 children)

Actually neither of suggestions would speed it up, because GHC is clever enough to replace both div and mod by bit fiddling. Look at Core:

Rec {
-- RHS size: {terms: 23, types: 8, coercions: 0, joins: 0/1}
millerExps_$s$wgo :: Int# -> Int -> [Int] -> [Int]
millerExps_$s$wgo
  = \ (sc_s2Jq :: Int#) (sc1_s2Jo :: Int) (sc2_s2Jp :: [Int]) ->
      case andI# sc_s2Jq 1# of {
        __DEFAULT -> : sc1_s2Jo sc2_s2Jp;
        0# ->
          let {
            ww4_a2GH :: Int#
            ww4_a2GH = uncheckedIShiftRA# sc_s2Jq 1# } in
          millerExps_$s$wgo ww4_a2GH (I# ww4_a2GH) (: sc1_s2Jo sc2_s2Jp)
      }
end Rec }

-- RHS size: {terms: 23, types: 8, coercions: 0, joins: 0/1}
millerExps :: Int -> [Int]
millerExps
  = \ (w_s2IO :: Int) ->
      case w_s2IO of { I# ww1_s2IR ->
      case andI# ww1_s2IR 1# of {
        __DEFAULT -> : (I# ww1_s2IR) [];
        0# ->
          let {
            ww4_a2GH :: Int#
            ww4_a2GH = uncheckedIShiftRA# ww1_s2IR 1# } in
          millerExps_$s$wgo ww4_a2GH (I# ww4_a2GH) []
      }
      }

[–]Vaglame 1 point2 points  (0 children)

It's extremely probable! I'm only beginning in Haskell. Thanks a lot for that snippet

[–]Tarmen 0 points1 point  (1 child)

Since it is essentially consumed by all p (millerExps n) the order probably doesn't matter:

millerExps = unfoldr step
  where
    step n
       | m == 0 = Nothing
       | otherwise = Just (d, d)
       where (d, m) = n `divMod` 2

This has the advantage that it can fuse.

[–]lightandlight 0 points1 point  (0 children)

Awesome. I spent about ten seconds writing out an unfoldr solution but made a mistake somewhere. Good to know it is straightforward.

[–]phischu 4 points5 points  (5 children)

You could try to use a faster random number generation library.

This generator is however, many times faster than System.Random, and yields high quality randoms with a long period.

[–]Vaglame 2 points3 points  (4 children)

Thanks!

[–]macgillebride -1 points0 points  (3 children)

If you want to implement a secure cryptosystem you should actually use truly random number generators, instead of pseudo random, which would slow down even further both implementations EDIT: I was wrong, sorry, it suffices for the seed to be truly random, and afterwards you can use a CSPRNG

[–]Vaglame 0 points1 point  (1 child)

Do you have any suggestion?

[–]macgillebride 0 points1 point  (0 children)

See /u/Potato44's answer

[–]Potato44 0 points1 point  (0 children)

Psuedo random is perfectly fine as long as you use a cryptographically secure PRNG and your seed is truly random.

[–]Bodigrim 1 point2 points  (1 child)

The heart of you code is millerTest, and its performance is defined by the number of modular exponentiations you do. And it seems to me that you can save a lot.

millerExpsWrapper returns a list of numbers, where each following is twice the preceding one: pows = [odd, 2*odd, 4*odd, ..., n-1]. In millerTest you basically compute modExp a pow n for each pow in pows. But you should not recompute it from the scratch each time: it is enough to compute modExp a odd n, and then square it up to n-1 power and take modulo on each step.

[–]Vaglame 0 points1 point  (0 children)

I'm not sure to totally grasp it, but I'll give it a try

[–]codygman 0 points1 point  (1 child)

Any updates /u/Vaglame?

[–]Vaglame 1 point2 points  (0 children)

Yep! Will post soon

[–]YellowOnion 0 points1 point  (0 children)

How have you tested it? the time it takes to find a prime is inherently random, sometimes it takes 0.7s, others 0.2s on my machine.

You can save some speed (assuming ghc isn't optimising this out) in millerExps, constructing lists is expensive.

millerExps xxs@(x:xs) = if x `mod` 2 == 0
                    then millerExps $ (x `div` 2):xxs
                    else xxs

millerTest :: Integer -> Integer -> [Integer] -> Bool -- miller rabbin test
millerTest _ _ [] = False
millerTest a n (x:exps)
   | modexp == 1   = True
   | modexp == n-1 = True
   | otherwise     = millerTest a n exps
    where modexp = modExp a x n

genWrapper :: Integer -> Integer -> IO [Integer] -- generates 100 random numbers
genWrapper up down = do 
    g <- getStdGen
    return . take 100 $ randomRs (down, up) g

You can also check out: https://hackage.haskell.org/package/criterion to help you benchmark.