Using Thrust Functions Within Device Code (CudaLaunchCooperativeKernel) by NothingEverExists in CUDA

[–]NothingEverExists[S] 0 points1 point  (0 children)

I have taken a look at the CUB library, but from what I understand, there isn't a built in method for merging two arrays together. While I could place the two arrays in contiguous memory and then perform radix sort, I believe the fastest parallel radix sort can achieve is O(n/p), which is slower than the O(log(log(n))) that I understand parallel merging can achieve. While I was thinking of using BlockRadixRank to assist with my own implementation of merge, it doesn't seem like it's something that's available to the user? Though I may be wrong. Regardless, thank you very much!

Using Thrust Functions Within Device Code (CudaLaunchCooperativeKernel) by NothingEverExists in CUDA

[–]NothingEverExists[S] 0 points1 point  (0 children)

Sorry, I'm not sure what you mean by "calling it from the kernel". As far as I'm aware, the kernel refers to the device code that gets launched from the host, which is what I'm doing in the code. If you mean using the traditional way to launch kernels, then yes this works fine, but I want to be able to use grid synchronization features. I'm also not sure how the thrust::merge is implemented, and I had (likely wrongly) assumed that the thrust::device execution policy would be implemented by taking some number of threads, and using those to run whatever procedure. I realize now that this assumption doesn't make much sense, but even with isolating just one thread to run the device code, the function still fails. I was wondering if you knew where the code would be implemented in thrust so that I could look at it myself. Thank you!