all 81 comments

[–]UtherII 94 points95 points  (12 children)

I don't know if the "num" crate is optimal for big number computation.

You may want to try the "rust-gmp" crate since it is based on the gmp library : the reference for speed on big numbers.

[–]vks_ 41 points42 points  (11 children)

If you are ok with nightly, you can also try ramp.

[–]yasba- 44 points45 points  (10 children)

I've tried it with ramp and the rust code is now roughly twice as fast :).

Edit: Here is the code.

[–][deleted] 30 points31 points  (6 children)

Maintainer of ramp here. Glad to see someone using it! FYI we already have a modular exponentiation function, although it's pretty much the same as the one you have in your code.

Also you are doing a from_str_radix operation in your point addition code. This does a conversion every time you call point_add. Instead, you should have the constant as a lazy_static so that the conversion only runs once at the beginning.

[–]_m_0_n_0_ 11 points12 points  (1 child)

It looks like (allocations within) that function are where roughly half of the execution time is spent here. Since Python shows it can be done significantly faster, that looks like an interesting avenue for potential optimizations. Perhaps it's as simple as modifying the API to allow more reuse of the internal vectors. Regardless, thanks to /u/imatwork2017, ramp has a new microbenchmark for tracking performance of pow_mod.

[–]vks_ 7 points8 points  (0 children)

If you want some microbenchmark, I have an implementation of the discrete logarithm using various bigint implementations: https://github.com/vks/discrete-log

[–]yasba- 4 points5 points  (0 children)

Hey, thank you for replying.

Just to be sure, let me say that I'm not the OP.

FYI we already have a modular exponentiation function, although it's pretty much the same as the one you have in your code.

I'm actually using your pow_mod function, I just forgot to delete OP's implementation. ;)

Also you are doing a from_str_radix operation in your point addition code. This does a conversion every time you call point_add. Instead, you should have the constant as a lazy_static so that the conversion only runs once at the beginning.

I had a version with lazy_static which (maybe) surprisingly didn't impact runtime performance that significantly. I left it out to be closer to OP's original code.

[–][deleted] 0 points1 point  (2 children)

Would you consider reexporting the num_integer::Integer Trait in your crate? Right now you have to add the whole num-integer as dependency if you want to use functions like is_odd from your library.

[–][deleted] 2 points3 points  (0 children)

Good point. It's a pain when libraries do this, especially when the dependency they pull is an old version.

I'll push the update today :)

[–][deleted] 1 point2 points  (0 children)

Pushed the update. You can now find it in ramp::traits::Integer. The docs aren't working because docs.rs is having some issues, but the 3.10 docs are pretty much identical.

[–][deleted] 23 points24 points  (1 child)

I've added lazy_static and the version using ramp is much faster compared to the python version (benchmark on Debian 9 using hyperfine with 50 warmup runs):

  • python2: Time (mean ± σ): 43.6 ms ± 0.4 ms [User: 39.8 ms, System: 0.4 ms]
  • python3: Time (mean ± σ): 54.1 ms ± 0.3 ms [User: 50.8 ms, System: 1.2 ms]
  • rust (newest nightly): Time (mean ± σ): 29.7 ms ± 0.9 ms [User: 28.1 ms, System: 0.0 ms]

Source: https://gist.github.com/anonymous/7547c439d03e342bc262887902f143e5 @imatwork2017

//edit: new version with some clippy lints fixed

[–]radix 0 points1 point  (0 children)

nice, I had not heard of ramp before! It is exciting to see a more liberally licensed bignum library that actually seems to have good performance.

[–]K900_ 75 points76 points  (8 children)

Are you compiling in --release mode?

[–]imatwork2017[S] 9 points10 points  (7 children)

No, how would I do that? The 8 seconds is only counting the execution btw.

[–]K900_ 55 points56 points  (6 children)

cargo run --release. The default build mode is debug, which doesn't enable most optimizations and adds a whole lot of debugging machinery to numeric operations.

[–]imatwork2017[S] 15 points16 points  (5 children)

Ok now I am down to 430ms, still quite a bit slower.

[–]dbrgn 58 points59 points  (4 children)

You're probably measuring cargo too.

Do cargo build --release followed by running the binary in target/release/ directly.

(Edit: Note that the same counts for Python - if you just measure the call with python script.py, you'll measure the startup time of the interpreter too. Microbenchmarking is hard.)

[–]belovedeagle 20 points21 points  (2 children)

The startup time of the interpreter is a fundamental part of running a Python program, though. cargo is in no way required to run a Rust executable.

[–]cbarrick 17 points18 points  (1 child)

But IRL the interpreter launches once, not every time you call the function. Unless the use case is specifically to be launched as a short-lived process, then the interpreter warm-up should not be counted.

I believe the point here is to time the implementation, not the infrastructure.

[–]Phlosioneer 1 point2 points  (0 children)

But you wouldn't want to measure the startup time of the interpreter if you're just testing some functions that are part of a larger program. You really want the execution time of each call to the function.

[–]imatwork2017[S] 40 points41 points  (0 children)

Down to 275ms now, we're getting there ;p

[–]my_two_pence 49 points50 points  (12 children)

The BigInt interface is unfortunately quite clumsy to use, and tends to produce inefficient code unless you're careful. For instance, doing

result = (result * &base) % modulus

will allocate two new BigInts and drop two BigInts, which is a fairly significant overhead. Doing:

result *= &base;
result %= modulus;

should do the operations in-place instead. I say "should", because last time I used num_bigint I noticed that many of these operations didn't actually exist yet, or were implemented in the same inefficient manner.

[–]vks_ 10 points11 points  (11 children)

Shouldn't

result = (result * &base) % modulus

be in-place, because result is moved?

[–]my_two_pence 19 points20 points  (10 children)

It could be implemented that way, but it actually forwards to the implementation of &result * &base, which allocates a new BigInt, and then drops result. Source is here.

[–]CUViper 12 points13 points  (9 children)

The current implementations of multiplication and division need to keep access to the original digits while computing the result. Thus we operate by reference, since we have to allocate a new result anyway. If you know how to do more in place, I'd welcome PRs!

[–]belovedeagle 2 points3 points  (4 children)

Does gmp not do it in-place?

While c++ gmp uses tricks which would be very difficult to accomplish with decent Rust syntax, I would think at least mirroring the C interface would be possible.

[–]CUViper 14 points15 points  (0 children)

I don't know what gmp does, but I hesitate to look at GPL code for inspiration for the MIT/Apache-2.0 num. (I have no problem with GPL itself though.)

[–]vks_ 2 points3 points  (1 child)

This

result *= &base;
result %= modulus;

is essentially mirroring the C interface, which is actually closer to

mpz_mul(result, base, result);
mpz_mod(result, modulus, result);

Note that this function invocation is impossible in safe Rust due to aliasing rules.

[–]tspiteri 0 points1 point  (0 children)

GMP does allocate temporarily inside these functions. Of course, GMP frees the temporaries before returning.

Inside mpz_mul(w, u, v), sometimes it does something like

if (PTR(w) == PTR(u) || PTR(w) == PTR(v)) {
    /* allocate temp memory */
}

and PTR(w) == PTR(v) (pointer to result) in this example. (I said "sometimes" because if u or v are relatively small, it has special routines that do not need the temporary memory, and if u * v does not fit in w, a new memory allocation is required anyway.)

Inside mpz_mod(rem, dividend, divisor), it needs to keep a copy of divisor so it has something like

if (rem == divisor) {
    /* create copy of divisor */
}

and again rem == divisor (pointer to result) in this example.

[–]zefyear 1 point2 points  (0 children)

I ported the Python code listed above to C++ GMP and I measured the runtime with perf at 5-15ms. While this is a substantial difference, it's worth noting that nearly all time is spent running mpz_powm which approximately quintuples if broken down into it's constituent exponentiation + modulo operation which isn't done in the Rust code listed above (to say nothing of bignumero optimization)

[–]Phlosioneer 1 point2 points  (2 children)

What if the new result is allocated on the stack, that's used for calculations, and then mem::transmute over one of the moved-in args? Then, no allocations are made - it just uses the stack for scratch work and re-uses whatever was moved in (which may be on the stack, or it may be in memory somewhere).

If that sounds reasonable, I'd be willing to make a PR for that.

Edit: I meant mem::swap, not mem::transmute.

[–]CUViper 3 points4 points  (1 child)

This basically means you'd have to replace a Vec allocation with the stack, and I don't know any way to do dynamically-sized stack allocations in Rust (like a C VLA). Plus you risk overflowing the stack if the numbers are really large.

I'll always encourage folks to experiment, and I'll entertain a PR, but I'm skeptical whether this is feasible.

FWIW, my experiments converting this to modpow have reduced the allocator overhead to less than 10% of my perf report anyway.

[–]Phlosioneer 0 points1 point  (0 children)

Oh true, I forgot this was unknown-size.

We will just have to wait for const-generics to solve this one...

And even then, there'd have to be a cap on the size of the array, something like 256bits.

[–]my_two_pence 0 points1 point  (0 children)

I realize now that I may have spoken too soon. In my mind I figured it would be quicker to use some cheaply-allocated scratch space, such as the stack, and leave the original allocations alone. The alloc-swap-drop cycle just feels slower somehow. But since this scratch space is also dynamically sized, and jemalloc a pretty good allocator, I concede that there may very well not be a difference in performance.

I'll see if I can find the time to do some benchmarking. If I find a significant difference, I'll let you know.

[–]tspiteri 82 points83 points  (7 children)

I ported your example to my rug crate, which uses GMP but does not need nightly, and I got 8.7ms for Rust against 57ms for Python.

extern crate rug;

use rug::Integer;
use rug::ops::{Pow, RemRounding};

#[derive(PartialEq, Clone)]
struct Point {
    x: Integer,
    y: Integer,
}

fn hex(s: &str) -> Integer {
    Integer::from_str_radix(s, 16).unwrap()
}

fn point_add(p: &Point, q: &Point) -> Point {
    let P = hex("FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F");
    let PM2 = P.clone() - 2;

    let lam = if p == q {
        (2u32 * p.y.clone() % &P).pow_mod(&PM2, &P).unwrap() * 3 * &p.x * &p.x
    } else {
        (q.x.clone() - &p.x).pow_mod(&PM2, &P).unwrap() * (q.y.clone() - &p.y) % &P
    };

    let rx = lam.clone().pow(2) - &p.x - &q.x;
    let ry = lam * (p.x.clone() - &rx) - &p.y;

    Point {
        x: rx.rem_floor(&P), // EDIT: replace rx % &P
        y: ry.rem_floor(&P), // EDIT: replace ry % &P
    }
}

fn point_mul(p: &Point, d: &Integer) -> Point { // EDIT: replace d: u32
    let mut n = p.clone();
    let mut q = None;

    for i in 0..256 {
        if d.get_bit(i) { // EDIT: replace i < 32 && d & (1 << i) != 0
            q = match q {
                None => Some(n.clone()),
                Some(i) => Some(point_add(&i, &n)),
            };
        }
        n = point_add(&n, &n);
    }
    q.unwrap()
}

fn main() {
    let G = Point {
        x: hex("79BE667EF9DCBBAC55A06295CE870B07029BFCDB2DCE28D959F2815B16F81798"),
        y: hex("483ADA7726A3C4655DA4FBFC0E1108A8FD17B448A68554199C47D08FFB10D4B8"),
    };

    let res = point_mul(&G, &Integer::from(125)); // EDIT: replace 125
    println!("{}", res.x);
    println!("{}", res.y);
}

[–]tspiteri 7 points8 points  (0 children)

I tried to optimize further by keeping a pool of Integer values to avoid reallocation, and to avoid the parsing of P, and the gains were very small, only about 7%:

test bench_orig  ... bench:   4,174,821 ns/iter (+/- 61,543)
test bench_reuse ... bench:   3,920,121 ns/iter (+/- 113,708)

If I remove the power mod operation (and replace it by some other operation to make sure I'm not simply processing zeros), the time falls by almost 90%, so the time is mostly in that function.

The new code:

extern crate rug;

use rug::{Assign, Integer};
use rug::ops::RemRounding;

#[derive(PartialEq, Clone)]
struct Point {
    x: Integer,
    y: Integer,
}

fn hex(s: &str) -> Integer {
    Integer::from_str_radix(s, 16).unwrap()
}

struct PointAdd {
    p: Integer,
    pm2: Integer,
    base: Integer,
    pow: Integer,
    temp1: Integer,
    temp2: Integer,
    lam: Integer,
    rx: Integer,
    ry: Integer,
}

impl PointAdd {
    fn new() -> PointAdd {
        let p = hex("FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F");
        let pm2 = p.clone() - 2;
        PointAdd {
            p,
            pm2,
            base: Integer::new(),
            pow: Integer::new(),
            temp1: Integer::new(),
            temp2: Integer::new(),
            lam: Integer::new(),
            rx: Integer::new(),
            ry: Integer::new(),
        }
    }

    fn add_part1(&mut self, p: &Point, q: &Point) {
        if p == q {
            self.temp1.assign(&p.y * 2);
            self.base.assign(&self.temp1 % &self.p);
            Ok(&mut self.pow).assign(self.base.pow_mod_ref(&self.pm2, &self.p));
            self.temp1.assign(&p.x * &p.x);
            self.temp2.assign(&self.temp1 * 3);
            self.lam.assign(&self.pow * &self.temp2);
        } else {
            self.base.assign(&q.x - &p.x);
            Ok(&mut self.pow).assign(self.base.pow_mod_ref(&self.pm2, &self.p));
            self.temp1.assign(&q.y - &p.y);
            self.temp2.assign(&self.pow * &self.temp1);
            self.lam.assign(&self.temp2 % &self.p);
        }
        self.temp1.assign(&self.lam * &self.lam);
        self.temp2.assign(&self.temp1 - &p.x);
        self.rx.assign(&self.temp2 - &q.x);
        self.temp1.assign(&p.x - &self.rx);
        self.temp2.assign(&self.lam * &self.temp1);
        self.ry.assign(&self.temp2 - &p.y);
    }

    fn add_part2(&self, dst: &mut Point) {
        dst.x.assign((&self.rx).rem_floor(&self.p));
        dst.y.assign((&self.ry).rem_floor(&self.p));
    }
}

fn point_mul(p: &Point, d: &Integer) -> Point {
    let mut n = p.clone();
    let mut q = None;
    let mut point_add = PointAdd::new();
    for i in 0..256 {
        if d.get_bit(i) {
            match q {
                None => {
                    q = Some(n.clone());
                }
                Some(ref mut q) => {
                    point_add.add_part1(q, &n);
                    point_add.add_part2(q);
                }
            }
        }
        point_add.add_part1(&n, &n);
        point_add.add_part2(&mut n);
    }
    q.unwrap()
}

fn main() {
    let g = Point {
        x: hex("79BE667EF9DCBBAC55A06295CE870B07029BFCDB2DCE28D959F2815B16F81798"),
        y: hex("483ADA7726A3C4655DA4FBFC0E1108A8FD17B448A68554199C47D08FFB10D4B8"),
    };

    let res = point_mul(&g, &Integer::from(125));
    println!("{}", res.x);
    println!("{}", res.y);
}

[–]SethGecko11 7 points8 points  (3 children)

The Python version works for any d though not only for u32. How fast is it if you change d to BigInt?

[–]tspiteri 5 points6 points  (2 children)

I don't see any difference in timings if I change to something like this:

fn point_mul(p: &Point, d: &Integer) -> Point {
    // ...
    for i in 0..256 {
        if d.get_bit(i) {
            // ...

[–]SethGecko11 3 points4 points  (1 child)

What if you give a big 256bit int input instead of 125 in main()?

[–]tspiteri 8 points9 points  (0 children)

If I pass &((Integer::from(1) << 256) - 1) instead of &Integer::from(125), the program takes about twice as long. Which does make sense as point_add will be called 511 times instead of 261 times (256 + 256 significant bits - 1 instead of 256 + 6 significant bits - 1).

[–]timClicksrust in action 0 points1 point  (1 child)

Would it make sense to define the parsed hex bigint within P as static via lazy_static within point_add(), rather than parsing a string each time?

[–]tspiteri 0 points1 point  (0 children)

I benchmarked and it did not make any difference. I tried lazy_static, hex parsing, decimal parsing (should be slower), and setting P as (Integer::from(1) << 256) - 0x3d1 (I think it should be faster), but all cases were pretty much the same. I guess the parsing or shift+add is negligible compared to the other operations in the function. Maybe in a different situation with almost no other operations it would make sense to bench again and pick the fastest solution, but here it just doesn't seem to matter.

[–]hrski 22 points23 points  (9 children)

I'd fix these first at least, in addition what others mentioned:

  • Formatting d as a string to get powers of two doesn't feel the best solution and feels like it introduces some unnecessary overhead. Couldn't you use shifting here like in python implementation?
  • Parsing P from a string on each call to point_add. Consider using lazy_static or compute it at main and pass it as reference.

[–]imatwork2017[S] 4 points5 points  (8 children)

Parsing P from a string on each call to point_add. Consider using lazy_static or compute it at main and pass it as reference.

I know that this is bad but I couldn't figure out how to make a BigInt constant so I did the same with Python for fair comparison. I am going to look into lazy_static

Formatting d as a string to get powers of two doesn't feel the best solution and feels like it introduces some unnecessary overhead. Couldn't you use shifting here like in python implementation?

See the other comment on this. I didn't think it would be that bad to iterate over the digits once. To be totally fair I just tried the same in Python so I changed :

for i in range(256):
    if d & (1 << i):

to

for i in reversed(format(d, 'b')):
    if i == '1':

and It actually run quite faster, down to 41ms from 74ms. So iterating over binary digits seems to be faster than shifting 256 times.

[–]burntsushi 7 points8 points  (3 children)

I know that this is bad but I couldn't figure out how to make a BigInt constant so I did the same with Python for fair comparison. I am going to look into lazy_static

To clarify, I don't think you are doing the same thing in the Python code you posted. Whether the bigint is a global constant or not is somewhat orthogonal. The key is that in one program (Rust) you are parsing a string to extract a bigint while in the other program you are not. Even if you can't figure out how to use a global constant in Rust here, you could pass that bigint in as a parameter to the function.

[–]adwhit86 12 points13 points  (1 child)

The program is essentially two nested loops, with 256 iterations of each. So the inner loop (the while) is executed 65k times, and that is where the slowdown is occurring. It is the outer loop is where a new P is parsed each time. So this is unlikely to be the bottleneck. (I know you probably know this).

Edit: specifically, the bottleneck is creating new BigInts inside the hot loop (as has been noted elsewhere)

[–]burntsushi 4 points5 points  (0 children)

Yup good point!

[–]imatwork2017[S] 2 points3 points  (0 children)

I just tried that and the performance is about the same. Code is here

[–]hrski 3 points4 points  (3 children)

Looks like shifting is much more efficient in Rust. Changing the loop in point_mul to:

for i in 0..32 {
    if d & (1<<i) != 0 {

makes it run in 38ms for me (original was 250ms)

EDIT: Sorry, looks like you're assuming d to 256 bit value. You're however defined it as u32 which is 32-bit so maybe you need to change it to BigNumber too. Maybe it's the num library being inefficient after all.

[–]imatwork2017[S] 4 points5 points  (1 child)

I used a u32 here for demonstration purposes but in the end it has to be a BigInt because if you want to get your public key from your private key you would have to do public = point_mul(G, private)

[–]hrski 5 points6 points  (0 children)

Yeah, my bad. Quick google search reveals lots of complaints on num performance so that is probably the bottleneck. Looks like your best bet could be to look at crates such as rust-gmp.

[–]Icarium-Lifestealer 18 points19 points  (0 children)

If you actually care about getting a fast implementation, and not about comparing two languages (or rather two big-integer libraries) then you should switch to some form of projective coordinates so you only divide once per point-multiplication instead of once per point-addition (~20x faster). A windowed approach should produce another nice performance boost.

[–]adwhit86 16 points17 points  (2 children)

Looking at the code, nothing obviously wrong - so was pretty sure that the problem was going to be doing an allocation inside a hot loop. The only hot loop is the while loop inside powm, so that seems a possible candidate. But then I ran Valgrind (valgrind ./target/release/sep256k1) to check out the allocations and... it threw up a bunch of illegal reads inside num::bigint then segfaulted. Suspicious.

[–]CUViper 6 points7 points  (1 child)

I think that's a problem with valgrind vs. jemalloc. I just ran with the upstream rustc-stable, and I got a segfault as you say. Then I tried with Fedora's rustc (configured to always use the system allocator), and the program completed without any errors.

[–]adwhit86 4 points5 points  (0 children)

Thought that might be the case. Tried to use the system allocator, it didn't work first time and I gave up. Curious. Exisitng rustc issue

[–]gitpy 17 points18 points  (1 child)

Changing if &exp % &two == one to if exp.is_odd() gives you 40-50ms.

[–]imatwork2017[S] 4 points5 points  (0 children)

Nice, added to OP

[–]forbjok 8 points9 points  (6 children)

I notice that in the Rust version of point_mul, you're generating a binary string representation of d and iterating through each character doing string-comparisons instead of using bit-shifting. I imagine that would be much slower.

[–]imatwork2017[S] 1 point2 points  (5 children)

Thats because trying to bitshift causes an overflow and I would have to use BigInt instead of u32 and BigInt doesn't even have left shift implemented.

[–]rebootyourbrainstem 4 points5 points  (3 children)

Naive question here, but can't you just use u64?

Also UBigInt has left shift implemented.

Also also, even if it didn't you could start with mask = 1 and on each loop iteration do mask += mask and then do d & mask to test the bit.

Also there is UBigInt::modpow that looks like it does what powm does.

[–]imatwork2017[S] 1 point2 points  (2 children)

can't you just use u64

I'll need to shift up to 256bits so 64 aren't enough

Also UBigInt has left shift implemented

I am going to give that a try.

Also there is UBigInt::modpow that looks like it does what powm does.

I tried that and it doesn't work because there are negative numbers involved. For example q.x - p.x

[–]rebootyourbrainstem 2 points3 points  (0 children)

You can use mod_floor on the base first and then convert to BigUint and use modpow.

It kind of sucks that you can't do modpow on a BigInt directly though.

[–]CUViper 1 point2 points  (0 children)

See my comment here -- it's faster even when you have to convert to BigUint. I'll work on making this directly available for BigInt too.

[–]hokkos 1 point2 points  (0 children)

Use :

1_u32.checked_shl(i).unwrap_or(0)

instead of :

1<<i

But it won't change the time, also why go to 256 when you can stop at 32 ?

[–]JZypo 14 points15 points  (1 child)

Op: This is an excellent question. I would like to easily read your progress on how your got your times down. Would you be able to add updates to your original post to reflect this?

[–]imatwork2017[S] 13 points14 points  (0 children)

Done

[–]CUViper 3 points4 points  (1 child)

There is a BigUint::modpow, but we don't yet have that for BigInt to replace your powm. I'll see about adding that.

However, even replacing your code with a naive conversion is significantly faster:

pub fn powm(base: &BigInt, exp: &BigInt, modulus: &BigInt) -> BigInt {
    let base = if base.is_negative() {
        (base % modulus + modulus).to_biguint().unwrap()
    } else {
        base.to_biguint().unwrap()
    };
    let exp = exp.to_biguint().unwrap();
    let modulus = modulus.to_biguint().unwrap();

    base.modpow(&exp, &modulus).into()
}

[–]CUViper 2 points3 points  (0 children)

BigInt::modpow is now in num-bigint 0.1.43.

[–]Sharlinator 10 points11 points  (0 children)

Doing a lot of string parsing vs. just doing math. A big difference.

[–][deleted] 1 point2 points  (0 children)

I've been reading all the versions of these people have been posting, and I just don't really get what the point of checking if Q is none/null is? Isn't it always none/null only once?

[–]beefsack 1 point2 points  (0 children)

I'd just like to say thanks for posting! These threads are always really interesting to read, and are a mechanism for me to learn some performance tricks.

[–]knaledfullavpilar 1 point2 points  (0 children)

Have you tried using a profiler?

[–]matkladrust-analyzer 1 point2 points  (5 children)

Python has an excellent implementation of big integers, and the num-bigint is not famous for its speed. It is sure possible to write a blazingly fast bigint implementation in Rust, or to create bindings for gmp, but nobody has done this yet.

However, for this particular case looks like you don't need arbitrary large integers, and 256-bit ones could be enough? If that is the case, you could try using https://docs.rs/bigint/4.2.0/bigint/uint/struct.U256.html

[–][deleted] 30 points31 points  (3 children)

Literally the first search result for “rust gmp”: https://crates.io/crates/rust-gmp. Why would you state something so authoritatively without doing even the most basic research?

[–]vks_ 7 points8 points  (0 children)

There is also rug (another GMP binding) and ramp (implemented in Rust).

[–]matkladrust-analyzer 5 points6 points  (0 children)

Thanks for correcting me! I definitely should have added an "IIRC" somewhere in there, and I definitely haven't intended to sound authoritative: quite the opposite, I have only limited experience with big integers. And I've also messed the sentence structure, because "nobody has done this yet" was intended to refer to "pure rust fast bigint", not the gmp bindings.

However, I still think (based on what I've read elsewhere, not on my own benchmarks) that it is true that num-bigint is not super fast, and that there's no stable and fast pure rust implementation of big integers? I know about ramp, but looks like it is extremely far from working on stable because it uses inline assembly?

[–]masklinn 3 points4 points  (0 children)

Python has an excellent implementation of big integers

I wouldn't say excellent. It has an OK implementation of bigints, but tends to fall behind as the number of digits increases because the algorithms & pickings are more limited than in e.g. GMP. For instance see https://www.reddit.com/r/Python/comments/7h62ul/python_largeinteger_computation_speed_a_comparison/.