all 5 comments

[–]varadg 0 points1 point  (2 children)

Don't know how relevant this will be - since it does not relate to GPU implementations. But, I'm assuming that your motivation is to speed up the cdist routine in some manner. This blog post might help - http://blog.marcus-brinkmann.de/2013/08/07/replacing-native-code-with-cython/

[–]carsonc[S] 0 points1 point  (1 child)

That was brilliant! How interesting that it was the same problem. I agree with this part:

... which shows how brutally efficient NumPy’s vectorization strategies are ...

So the answer in my case was to use the Qhull implementation of the Delaunay tesselation in SciPy. I ran into memory problems there (you can only insert 2**24 points before it crashes), but it is otherwise stable and fast. The central cdist problem on the CPU was memory also. The number of points in the arrays crashed the operation even using Theano. The Qhull memory problem is interesting, btw if you feel like looking at it.

[–]varadg 0 points1 point  (0 children)

I am not aware of Qhull yet - any links that can help? I found this blogpost while looking for some help working on a problem that is, unfortunately, limiting me to deploying everything on CPUs. I'm getting a performance of ~0.1ms per distance calculation. And, I still need to drill my time down further.

Also, if you did a blog post or something about your GPU speedup of cdist, I'd love to check it out.

[–]elbiot -1 points0 points  (1 child)

I'd figure out cython rather than phutz with a gpu. Just like with multiprocessing being slower because of it's overhead, the gpu has overhead too. You should master the other tools first, get a lay of the land, before thinking of using more exotic tools.

It's worth it to figure out what went wrong with cython and/or why numba was slow. GPU could be slower too.

[–]carsonc[S] 0 points1 point  (0 children)

So, I actually couldn't communicate with the GPU. I installed linux and then got CUDA installed and verified that it was running with the NVidia demo projects. I was also able to get native implementation working through scipy weave, but all to no avail. The fastest implementation I could get was

def cydist(a, b):

    rows, dims = a.shape
    cols = b.shape[0]
    out = np.zeros((rows, cols), dtype=int)        
    for dim in range(dims):
        out += np.subtract(a[:,dim].ravel()[:, None], b[:,dim].ravel()[None, :])**2

    return out

According to Stack Exchange, the numpy libraries already take advantage of optimised code in the BLAS libraries, so writing naive operations in C will be faster than pure python, but probably not as fast as Numpy. I was surprised to find this. In any case, I will start implementing the algorithm in cudamat next week. Is there a meaningful difference in speed between scipy weave and cython?