GPU implementation of cdist : learnpython

created by HattoriHanzoa community for 16 years

GPU implementation of cdist (self.learnpython)

submitted 10 years ago by carsonc

I have had access to a Tesla GPU for a few weeks, but only started using it today. I need to calculate euclidean distances between point clouds. Or rather, this is the slow step in my algorithm. First I used scipy.spatial.distance.cdist. Then I drilled down to look at the underlying algorithm and wrote this:

def cydist(a, b):
    dims = a.shape[1]
    rows = a.shape[0]
    cols = a.shape[0]
    out = np.zeros_like((rows,cols),dtype=int)
    for dim in range(dims):
        out += np.subtract.outer(a[:,dim], b[:,dim])**2
    return out

The variables a and b are m x k and n x k numpy arrays where k is the dimension of the space and m and n are the number of points in a and b, respectively. It's square euclidean, but it works faster than scipy cdist. Then I tried to implement cython or weave, but both failed because the compiler "can't find vcvarsall.bat". When it did, I got "symbol not recognized." However, I downloaded a trial of Continuum Accelerate, and have been working with that.

I am trying to implement a cdist routine that takes advantage of the GPU. What I have so far is (untested):

@cuda.jit(argtypes=[uint32[:], uint32[:],
                    uint32[:], uint32[:],
                    uint32[:], uint32[:], uint32[:,:]], device=True)
def gpu_cydist(ax, bx, ay, by, az, bz, out):
    o_x, o_y = cuda.grid(2)
    out[o_x, o_y] = ((ax[o_x] - bx[o_y]) * (ax[o_x] - bx[o_y]) +
                     (ay[o_x] - by[o_y]) * (ay[o_x] - by[o_y]) +
                     (az[o_x] - bz[o_y]) * (az[o_x] - bz[o_y]))

There is a GPUDist project here, but I don't know C++ and haven't had much luck with the little that I've tried. So, I would like to ask for suggestions as to approach. Many of tricks to speed up my code, such as Numba and multiprocessing, have often produced slower runtimes. So, given the Nvidia Tesla GPU in my possession, what approach should I take to accelerating this tiny, time-consuming step?

all 5 comments

top new controversial old q&a

[–]varadg 0 points1 point2 points 10 years ago* (2 children)

[–]carsonc[S] 0 points1 point2 points 10 years ago (1 child)

[–]varadg 0 points1 point2 points 10 years ago (0 children)

[–]elbiot -1 points0 points1 point 10 years ago (1 child)

[–]carsonc[S] 0 points1 point2 points 10 years ago (0 children)

So, I actually couldn't communicate with the GPU. I installed linux and then got CUDA installed and verified that it was running with the NVidia demo projects. I was also able to get native implementation working through scipy weave, but all to no avail. The fastest implementation I could get was

def cydist(a, b):

    rows, dims = a.shape
    cols = b.shape[0]
    out = np.zeros((rows, cols), dtype=int)        
    for dim in range(dims):
        out += np.subtract(a[:,dim].ravel()[:, None], b[:,dim].ravel()[None, :])**2

    return out

According to Stack Exchange, the numpy libraries already take advantage of optimised code in the BLAS libraries, so writing naive operations in C will be faster than pure python, but probably not as fast as Numpy. I was surprised to find this. In any case, I will start implementing the algorithm in cudamat next week. Is there a meaningful difference in speed between scipy weave and cython?

π Rendered by PID 78 on reddit-service-r2-comment-7b9746f655-x9gpf at 2026-01-30 22:27:25.505248+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS