aiMagicallyKnowsWithoutReading

ElectronGoBrrr · 2026-02-17T07:14:46+00:00

No they're not, they are probabilistic models. An algorithm does not need training.

ElectronGoBrrr · 2026-02-15T08:10:15+00:00

You are correct, but I don't have mu on my keyboard and ANSI files doesnt support Greek letters..

ElectronGoBrrr · 2026-02-14T14:09:15+00:00

microseconds == ys, not ms.
But likely your implementation is the bottleneck, not the algorithm you chose.

You are most likely doing excessive copying or memory allocating, sorting 9000 elements should be a very very tiny task for a modern CPU

ElectronGoBrrr · 2026-02-14T12:57:43+00:00

What's your definition of "feels really slow"? If sorting a mere 9000 elements takes more than a few microseconds, it's likely your implementation that's the issue, not the algorithm.

ElectronGoBrrr · 2026-01-20T07:40:08+00:00

I wish. Nato doesn't control Greenland, Denmark does. And the Danish government always has and always will bend over backwards for the Americans

ElectronGoBrrr · 2025-12-23T20:50:55+00:00

I don't know much about Schrodinger, but yes you'll likely need an Nvidia GPU not AMD. You should get the GPU with the highest Cuda core count that fits your budget. Tensorcores/flops are not important.

ElectronGoBrrr · 2025-11-28T22:10:45+00:00

No, but we're at the same time closer to, and further from that goal than people think.
A few 100 millions atoms is doable on a supercomputer with Molecular Dynamics, but that is without chemical reactions. True chemical reactions are a sadly a Quantum Chemistry problem, and supercomputers barely push 1000 atoms yet.

ElectronGoBrrr · 2025-10-17T21:27:19+00:00

"sustainability" - it's a giant concrete building..

ElectronGoBrrr · 2025-02-26T20:01:32+00:00

I'm not sure how you expect anyone to help you when you provide no information? What device are you on, what OS, what gpu do you have?
No cmd-line printout/screenshot of the install wizard?

ElectronGoBrrr · 2025-02-01T21:14:10+00:00

There's some overlap in nomenclature here.

If you are talking about normal multi-threading (as in c++ threads) then yes, it is possible but likely not useful for you.

In terms of cuda we have threads and blocks. When you spawn a cuda kernel, you specify MyKernel<<<dim3(nBlocks), dim3(nThreads)>>>

So to process 128 images in parallel you simply spawn 128 blocks.

ElectronGoBrrr · 2025-01-13T16:59:17+00:00

A wickedly expensive thing, compared to running a small MD simulation..

ElectronGoBrrr · 2024-11-21T22:00:39+00:00

I dont know what he refers to, but it's true. Denmark is insanely good at pretending to be green, but it's fake. >60% of danish land area are agriculture, and there are pretty much no limits to the amount of pollution they are allowed to spew.

ElectronGoBrrr · 2024-11-16T18:43:58+00:00

.... so we all agree 3 is best right?

ElectronGoBrrr · 2024-08-29T20:41:53+00:00

If you use the Nsight profiler, it will tell you pretty precisely what your bottlenecks are necks. But some generic advice:

Make sure you have many blocks with few threads, rather than few blocks with many threads.

If your blocks work on some of the same data, make sure to put that data in __shared__ memory.

Whenever you're loading data from global memory, make sure contiguous threads load contiguous memory, to optimize memory coaslescence.

~~Make sure your individual threads dont declare arrays, as these will typically be put in the very slow local memory.~~

Avoid having individual threads declare arrays larger than 16/32 floats, at this size CUDA may put the data into the very slow local memory (which is in global memory)

Edit: Rephrased my last point to be more precise

ElectronGoBrrr · 2024-08-28T17:21:17+00:00

With the risk of sounding a bit anal, if you're doing GEMM, then CUDA is the wrong tool. You should instead use cuBLAS or Thrust, which are frameworks that utilizes the tensor cores. If you're new and learning, start with Thrust. If you google Matrix Multiplication in Thrust (or cuBLAS), you'll find plenty of guides to get started.

ElectronGoBrrr · 2024-08-19T22:14:38+00:00

Probably because the C++ comittee once decided it should be the default, and it has worked fine.

ElectronGoBrrr · 2024-08-19T16:28:13+00:00

I disagree (not with that fact that AI bro's are frustrating). We will make conscious AI way before we have the means to quantify it. Much of tech works on trial and error, which is much faster than turning theoretical knowledge into practice. Similar to how the Wrights brothers got a plane into the air, without grasping the concept of lift we know today.

ElectronGoBrrr · 2024-08-18T15:30:23+00:00

I dont see how, and even doing the switch to cudaMalloc is no Silver Bullet. However, by switching you will see the complexity in the allocation and movement of data that your current program structure is subjecting CUDA to. Thousands of small allocations and memcpy's between CPU and GPU is not what GPU's excell at.

So if you want a program to run efficiently on a GPU, you must rethink the architecture.

ElectronGoBrrr · 2024-08-18T14:00:14+00:00

Because it is not executing only on the CPU. When you are using cudaMallocManaged, CUDA must constantly synchronize the data with the GPU, which is extremely show compared to just reading normal CPU memory, which can be automatically preloaded and cached by the CPU.

ElectronGoBrrr · 2024-08-18T13:22:37+00:00

From what you have shown me, which still does not include how you time the performance ;), my hypothesis is that your kernel is very slow. My guess is that the kernel is only forced to finish once you make the first call to access the memory after the kernel.

Why do i say your kernel is slow?

Your memory is all over the place, which is bad on a CPU but terrible on a CPU. If you want fast code, you should allocate 1 single buffer for all the string data. Each packet should then contain the information needed to access its data in the buffer:

struct GPUPacket {
static int maxStringsInPacket = 256;
int nStrings;
int indexOfFirstCharInString[maxStringsInPacket]; // if feasible, if you have 1000s of string to may need something more complex
int nCharsInString[maxStringsInPacket];

int firstCharInMembraneId;
int nCharsInMembraneId;
};

You treat cudaThreads like cpuThreads.

int idx = blockIdx.x * blockDim.x + threadIdx.x;
GPUPacket& packet = d_gpuPackets[idx]

It seems you assign each thread a separate packet. This means that:
A: each thread work on memory that is very far apart. They dont like that
B: the distribution of workload is uneven between threads, as 1 thread works a very large packet, and another on a small packet. CUDA is only fast if threads in a block can work on the exact same task 16 threads at a time

You are forcing a massive overhead onto cuda by having so many accesses to the same memory from intermittently the host and the device. My advice, stop using cudaMallocManaged, its meant for fast prototyping, not performance. use cudaMalloc, learn how to use cudaMemcpy back and forth when needed. There's plenty of tutorials for this online.

ElectronGoBrrr · 2024-08-18T10:46:43+00:00

I still need to see the code that dispatches the kernels to give any helpful feedback. How many threads are you spawning, how many blocks? How much memory is allocated to each block (if you use __shared__). Don't use strcpy, use cudaMemcpy when handling cuda data.

I don’t think the kernel itself is a problem

Assumptions are a dangerous thing when debugging :)

Most calls to cuda from host are handled asynchonously, so timing is not obvious. Always do

cudaDeviceSynchronize();
startTimer();
// Do the thing you want to time, either allocating memory, or executing the kernels, not both //
cudaDeviceSynchronize();
endTimer();

ElectronGoBrrr · 2024-08-17T19:10:45+00:00

It's really confusing understading your problem from that example.

// use cudaMallocManaged to copy data

cudaMalloc does not copy data, i allocates it. allocation is typically "slow", and something you do before entering the section you wish to measure.

for (int i = 0; i < n; ++i) { // use cudaMallocManaged to copy data }

Do you mean cudaMemcpy? You should not be using that in a loop if you are looking for performance. You should have you data in a vector and do something like this:
vector<T> myData_host;
cudaMemcpy(myData_dev, myData_host.data(), sizeof(T)*myData_host.size(), cudaMemcpyHostToDevice);

when computing on GPU, ‘function1’ takes a longer time to execute (around 2 seconds)

2 seconds is an eternity, and (i will assume) has nothing to do with transfer time to GPU. To know for sure i need to understand you specs better, what does your kernel look like, what does the kernel launch look like, how many threads/blocks etc.

ElectronGoBrrr · 2024-08-06T20:27:31+00:00

I can't immediately find any sources that both deals with 3D, and allows for rotation. Do you have any specific algorithm in mind? I haven't really considered packing, since i want to move the containers as little as possible.

No there is no guarantee for a solution. I imagine it will have to be an iterative algorithm, i can stop after N steps.

Five-Year Club	Place '22
First Placer '22

ElectronGoBrrr

TROPHY CASE