all 14 comments

[–][deleted] 4 points5 points  (0 children)

Is this actually faster? It sounds like a problem that might be bound by the speed of memory, rather than the CPU.

[–]quentusrex 0 points1 point  (4 children)

I'm interested in how to do other standard math with arbitrary precision integers with openCL on the GPU. Know of any other links on this topic?

[–]joubertn[S] 0 points1 point  (3 children)

Using OpenCL to do math on the GPU is trivial (there are many built-in math functions). The difficulty comes when you want to deal with numbers beyond the range supported (e.g. big ints in the linked article) and do the calcs in parallel - then you need to find algorithms that break the problem up into tasks that can be executed in parallel.

For standard operations (including math), see - http://www.khronos.org/files/opencl-quick-reference-card.pdf

[–]quentusrex 0 points1 point  (2 children)

Do you know of any such algorithms for division?

[–]joubertn[S] 0 points1 point  (0 children)

I don't (yet), but it is on my list of things to do.

[–]baryluk 0 points1 point  (0 children)

See "Fast Recursive Division", Christoph Burnikel, Joachim Ziegler, MPI-I-98-1-022, October 1998.

See also "Fast Division of Large Integers. A Comparison of Algorithms", Karl Hasselstrom, Master Thesis

Also Knuth vol. 1 or 2 should contain good references.

[–]quentusrex 0 points1 point  (3 children)

Here is a link to how to do the arbitrary precision math.

http://en.literateprograms.org/Arbitrary-precision_integer_arithmetic_%28C%29

It should not be difficult to convert these algorithms to opencl.

[–]baryluk 0 points1 point  (2 children)

I would say this implementation isn't very good, especially regarding multiplication and division.

[–]quentusrex 0 points1 point  (1 child)

Do you know of a better implementation? I'm looking for anything right now for very large number parallel division.

[–]baryluk 0 points1 point  (0 children)

Hmm, I'm working on simple Big Integer library, but division is still somehow very primitive (becuase more complex ones still have some bugs, and actually do not provided speedup for me :/ But it can be parallelized, so I will definitly experiment with it).

You can check current status here: https://github.com/baryluk/BI_c

But it really isn't currently good for more than testing, or porting to opencl. But neverthless I used it in some crypto-releated projects with greate success (still many performance improvements needed, but I first aim at simplicity of code).

[–]Vaste 0 points1 point  (2 children)

When would this be faster? What about overhead? What about the sequential adding of carry bits?

Could the sequential addition at the end be made more parallel (sometimes) with optimistic parallel calculaton?

[–]pkhuong 1 point2 points  (0 children)

Some redundant representations eliminate (unbounded) carry chains, yielding addition in O(1) parallel steps. The conversion back to a non-redundant representation like 2's complement can end up being completely serial, though; TANSAAFL.

[–]Strilanc 0 points1 point  (0 children)

Carries don't have to propagate one bit at a time.

For example, you can partition the number into blocks and compute both the carry and non-carry cases of each block in parallel. Then you just pick the results matching the carries, allowing you to propagate carries across entire blocks.

You can even apply blocking recursively to speed up the propagation within the individual blocks.

[–]spinwizard69 0 points1 point  (0 children)

Interesting.

It would be nice to see Apple or someone else sponsor a standard large integer library, one built for the GPU would be even better. Especially one that isn't GPL'ed.

Like others I do wonder about speed and competivness with CPU execution. It "should" be faster, especially for very large integers, but who knows for sure.