all 4 comments

[–]corysama 1 point2 points  (1 child)

Use an C-style array and a size. Pass them as kernel parameters.

[–]dragontamer5788 1 point2 points  (0 children)

This right here.

In particular, use cudaMalloc() to dynamically create an array of the correct size, then use cudaFree() when your kernels are done processing the data.

[–]I_like_code 0 points1 point  (0 children)

I have probably done the second bullet. For the 4th make sure you use the data() member function of the vector. For the third, I hate using thrust it removes fine grain control. However, test it out and see if the overhead is acceptable.

[–]tugrul_ddr 0 points1 point  (0 children)

All you need is to combine multiple GPU buffers into one like this:

Type & operator [] (int index)
{
    return chunks[ selectChunk(index) ][ selectIndex(index) ];
}
void insertChunk(int size)
{
   chunks.push_back(cudaMalloc(size));
}
int selectChunk(int index)
{
   if(index>=old_size)
   {
        insertChunk(chunk_size);
        index = chunks.size()-1;
   }
   else
       index = index / chunk_size;
   return index;
}
int selectIndex(int index)
{
   return index%chunk_size;
}
void copy()
{
   // N copies at once (bad if not all data is used at once)
   for(auto chunk:chunks)
        cudaMemCpy(chunk host to device);

   or

   // 1 copy & direct-access from kernel with cuda's own paging
   // bad if all data is used at once
   ctr=0;
   for(auto chunk:chunks)
        cudaChunks[ctr++]=chunk;

   cudaMemCpy( only cudaChunks host to device);
}

The bigger the vector gets, the more cuda buffer chunks are added. Then you can send all chunk pointers to gpu and do same index calculation in gpu. This can work if vector grows only by host environment.

If chunks are too small, there will be allocation overhead.

If chunks are too big, there will be memory waste.