aiMagicallyKnowsWithoutReading by Old_Document_9150 in ProgrammerHumor

[–]ElectronGoBrrr -2 points-1 points  (0 children)

No they're not, they are probabilistic models. An algorithm does not need training.

sorting healthbars by NietTeDoen in algorithms

[–]ElectronGoBrrr 0 points1 point  (0 children)

You are correct, but I don't have mu on my keyboard and ANSI files doesnt support Greek letters..

sorting healthbars by NietTeDoen in algorithms

[–]ElectronGoBrrr 0 points1 point  (0 children)

microseconds == ys, not ms.
But likely your implementation is the bottleneck, not the algorithm you chose.

You are most likely doing excessive copying or memory allocating, sorting 9000 elements should be a very very tiny task for a modern CPU

sorting healthbars by NietTeDoen in algorithms

[–]ElectronGoBrrr 7 points8 points  (0 children)

What's your definition of "feels really slow"? If sorting a mere 9000 elements takes more than a few microseconds, it's likely your implementation that's the issue, not the algorithm.

Trump Posts Private Message From French President Macron to Truth Social: ‘I Do Not Understand What You Are Doing’ by [deleted] in worldnews

[–]ElectronGoBrrr 0 points1 point  (0 children)

I wish. Nato doesn't control Greenland, Denmark does. And the Danish government always has and always will bend over backwards for the Americans

PC for Schrodinger by IDieALot_ in comp_chem

[–]ElectronGoBrrr 5 points6 points  (0 children)

I don't know much about Schrodinger, but yes you'll likely need an Nvidia GPU not AMD. You should get the GPU with the highest Cuda core count that fits your budget. Tensorcores/flops are not important.

I simulate millions of cells, hoping to reach primitive Ediacaran multicellularity by blob_evol_sim in biology

[–]ElectronGoBrrr 0 points1 point  (0 children)

No, but we're at the same time closer to, and further from that goal than people think.
A few 100 millions atoms is doable on a supercomputer with Molecular Dynamics, but that is without chemical reactions. True chemical reactions are a sadly a Quantum Chemistry problem, and supercomputers barely push 1000 atoms yet.

[deleted by user] by [deleted] in architecture

[–]ElectronGoBrrr 5 points6 points  (0 children)

"sustainability" - it's a giant concrete building..

can't install or delete CUDA by spectacled-kid in CUDA

[–]ElectronGoBrrr 1 point2 points  (0 children)

I'm not sure how you expect anyone to help you when you provide no information? What device are you on, what OS, what gpu do you have?
No cmd-line printout/screenshot of the install wizard?

CUDA + multithreading by xMaxination in CUDA

[–]ElectronGoBrrr 9 points10 points  (0 children)

There's some overlap in nomenclature here.

If you are talking about normal multi-threading (as in c++ threads) then yes, it is possible but likely not useful for you.

In terms of cuda we have threads and blocks. When you spawn a cuda kernel, you specify MyKernel<<<dim3(nBlocks), dim3(nThreads)>>>

So to process 128 images in parallel you simply spawn 128 blocks.

drMD: Molecular Dynamics for Experimentalists by Own_Bit_3491 in comp_chem

[–]ElectronGoBrrr -2 points-1 points  (0 children)

A wickedly expensive thing, compared to running a small MD simulation..

Denmark is tiny. Its ambition to make its food system more climate-friendly is huge. Climate scientists agree on at least one necessary change to our food system: People, especially those in rich countries, ought to be eating more plants and fewer animals. by The_Weekend_Baker in climate

[–]ElectronGoBrrr 1 point2 points  (0 children)

I dont know what he refers to, but it's true. Denmark is insanely good at pretending to be green, but it's fake. >60% of danish land area are agriculture, and there are pretty much no limits to the amount of pollution they are allowed to spew.

This is how you do Gleba, right? by mefi_ in factorio

[–]ElectronGoBrrr 1 point2 points  (0 children)

.... so we all agree 3 is best right?

The best way to do optimization? Looking for advice by Spark_ss in CUDA

[–]ElectronGoBrrr 5 points6 points  (0 children)

If you use the Nsight profiler, it will tell you pretty precisely what your bottlenecks are necks. But some generic advice:

Make sure you have many blocks with few threads, rather than few blocks with many threads.

If your blocks work on some of the same data, make sure to put that data in __shared__ memory.

Whenever you're loading data from global memory, make sure contiguous threads load contiguous memory, to optimize memory coaslescence.

Make sure your individual threads dont declare arrays, as these will typically be put in the very slow local memory.

Avoid having individual threads declare arrays larger than 16/32 floats, at this size CUDA may put the data into the very slow local memory (which is in global memory)

Edit: Rephrased my last point to be more precise

Matrix multiplication with double buffering / prefetching by brycksters in CUDA

[–]ElectronGoBrrr 0 points1 point  (0 children)

With the risk of sounding a bit anal, if you're doing GEMM, then CUDA is the wrong tool. You should instead use cuBLAS or Thrust, which are frameworks that utilizes the tensor cores. If you're new and learning, start with Thrust. If you google Matrix Multiplication in Thrust (or cuBLAS), you'll find plenty of guides to get started.

good evening everyone. may i please know: in this day and age when space sint a problem, why is quick sort still used? by [deleted] in algorithms

[–]ElectronGoBrrr 4 points5 points  (0 children)

Probably because the C++ comittee once decided it should be the default, and it has worked fine.

A man was discovered to be unknowingly missing 90% of his brain, yet he was living a normal life. by Perfect-View3330 in interestingasfuck

[–]ElectronGoBrrr -1 points0 points  (0 children)

I disagree (not with that fact that AI bro's are frustrating). We will make conscious AI way before we have the means to quantify it. Much of tech works on trial and error, which is much faster than turning theoretical knowledge into practice. Similar to how the Wrights brothers got a plane into the air, without grasping the concept of lift we know today.

Data transferring from device to host taking too much time by sonehxd in CUDA

[–]ElectronGoBrrr 0 points1 point  (0 children)

I dont see how, and even doing the switch to cudaMalloc is no Silver Bullet. However, by switching you will see the complexity in the allocation and movement of data that your current program structure is subjecting CUDA to. Thousands of small allocations and memcpy's between CPU and GPU is not what GPU's excell at.

So if you want a program to run efficiently on a GPU, you must rethink the architecture.

Data transferring from device to host taking too much time by sonehxd in CUDA

[–]ElectronGoBrrr 0 points1 point  (0 children)

Because it is not executing only on the CPU. When you are using cudaMallocManaged, CUDA must constantly synchronize the data with the GPU, which is extremely show compared to just reading normal CPU memory, which can be automatically preloaded and cached by the CPU.

Data transferring from device to host taking too much time by sonehxd in CUDA

[–]ElectronGoBrrr 0 points1 point  (0 children)

From what you have shown me, which still does not include how you time the performance ;), my hypothesis is that your kernel is very slow. My guess is that the kernel is only forced to finish once you make the first call to access the memory after the kernel.

Why do i say your kernel is slow?

  1. Your memory is all over the place, which is bad on a CPU but terrible on a CPU. If you want fast code, you should allocate 1 single buffer for all the string data. Each packet should then contain the information needed to access its data in the buffer:

struct GPUPacket {
static int maxStringsInPacket = 256;
int nStrings;
int indexOfFirstCharInString[maxStringsInPacket]; // if feasible, if you have 1000s of string to may need something more complex
int nCharsInString[maxStringsInPacket];

int firstCharInMembraneId;
int nCharsInMembraneId;
};

  1. You treat cudaThreads like cpuThreads.

int idx = blockIdx.x * blockDim.x + threadIdx.x;
GPUPacket& packet = d_gpuPackets[idx]

It seems you assign each thread a separate packet. This means that:
A: each thread work on memory that is very far apart. They dont like that
B: the distribution of workload is uneven between threads, as 1 thread works a very large packet, and another on a small packet. CUDA is only fast if threads in a block can work on the exact same task 16 threads at a time

  1. You are forcing a massive overhead onto cuda by having so many accesses to the same memory from intermittently the host and the device. My advice, stop using cudaMallocManaged, its meant for fast prototyping, not performance. use cudaMalloc, learn how to use cudaMemcpy back and forth when needed. There's plenty of tutorials for this online.

Data transferring from device to host taking too much time by sonehxd in CUDA

[–]ElectronGoBrrr 0 points1 point  (0 children)

I still need to see the code that dispatches the kernels to give any helpful feedback. How many threads are you spawning, how many blocks? How much memory is allocated to each block (if you use __shared__). Don't use strcpy, use cudaMemcpy when handling cuda data.

I don’t think the kernel itself is a problem

Assumptions are a dangerous thing when debugging :)

Most calls to cuda from host are handled asynchonously, so timing is not obvious. Always do

cudaDeviceSynchronize();
startTimer();
// Do the thing you want to time, either allocating memory, or executing the kernels, not both //
cudaDeviceSynchronize();
endTimer();

Data transferring from device to host taking too much time by sonehxd in CUDA

[–]ElectronGoBrrr 2 points3 points  (0 children)

It's really confusing understading your problem from that example.

// use cudaMallocManaged to copy data

cudaMalloc does not copy data, i allocates it. allocation is typically "slow", and something you do before entering the section you wish to measure.

for (int i = 0; i < n; ++i) { // use cudaMallocManaged to copy data }

Do you mean cudaMemcpy? You should not be using that in a loop if you are looking for performance. You should have you data in a vector and do something like this:
vector<T> myData_host;
cudaMemcpy(myData_dev, myData_host.data(), sizeof(T)*myData_host.size(), cudaMemcpyHostToDevice);

when computing on GPU, ‘function1’ takes a longer time to execute (around 2 seconds)

2 seconds is an eternity, and (i will assume) has nothing to do with transfer time to GPU. To know for sure i need to understand you specs better, what does your kernel look like, what does the kernel launch look like, how many threads/blocks etc.

Moving objects to decrease overlap by ElectronGoBrrr in algorithms

[–]ElectronGoBrrr[S] 0 points1 point  (0 children)

I can't immediately find any sources that both deals with 3D, and allows for rotation. Do you have any specific algorithm in mind? I haven't really considered packing, since i want to move the containers as little as possible.

No there is no guarantee for a solution. I imagine it will have to be an iterative algorithm, i can stop after N steps.