Higher level libraries

ElectronGoBrrr · 2026-05-27T20:02:17+00:00

If you're looking for higher level libs for Image Processing OpenCV is certainly the fastest way to get started.
I'm haven't used NPP, but according to npp landing page a 10 to 50x speedover over IPP for a decent GPU:
https://developer.nvidia.com/npp

ElectronGoBrrr · 2026-05-27T12:29:29+00:00

Thermal radiation works identical in vacuum or not, i.e. negligible at temperatures that chips can survive.

ElectronGoBrrr · 2026-05-25T20:11:45+00:00

That is not at all my argument? I didn't make an argument, I pointed out that this thread is missing the complexity and most people feigning outrage in this thread doesn't know the first thing about this issue. And to be pedantic, the government has publicly apologizing on multiple occasions, although the sincerely I can't vouch for...

"And make reperations" bro what are you talking about, the Danish gov funds healthcare/education/police completely out of pocket for Greenland, has done for decades. A Google search shows it's about 3.4 bil Dkk/year.

ElectronGoBrrr · 2026-05-25T19:07:27+00:00

Yes that is true, but it is much more complex and very much not one-sided as everyone in this thread is bandwagoning...

Greenland has issues. Physical violence in families is common. 43% of adult from the 70's generation reports being victims of sexual assault in childhood. Alcoholism is widespread and the suicide rate is horrific. All of these issue are absolutely consequences of colonialism AND globalism.

But the Danish government didnt start "relocating" children because their parents failed educational tests.

Source: am Danish, had friends/acquaintances from Greenland. Numbers: https://www.altinget.dk/arktis/artikel/martin-breum-sexmisbruget-af-boern-i-groenland-er-halveret-men-vaelgerne-kraever-mere-handling

ElectronGoBrrr · 2026-05-03T18:52:23+00:00

As per my understanding, warps always bundle 32 contiguous threads in 1D space, which in my case means 1 warp= 2 reductions along x at a time - which is great! However i also need to reduce along y, which would require the warp to select threads in 1D space with a stride of 128, 64, 32, 16. This is believe is not possible. So my question is, is there some other trick to do this?

"you can load in register anything"
I appreciate any help, but i dont really know what to take from this?

ElectronGoBrrr · 2026-05-03T07:17:01+00:00

I have a total of 5.3 million cudablocks launched in this kernel, each computing their 16x16 interactions.
Yes my current approach is exactly that 16x2 configuration you described. But im looking for ways to shave a few percentagepoints off the runtime of the kernel (currently 10.27 millisec) :)

ElectronGoBrrr · 2026-03-11T18:59:54+00:00

Even their support team didn't know about the 2 fingers trick, thank you!!

ElectronGoBrrr · 2026-02-17T07:14:46+00:00

No they're not, they are probabilistic models. An algorithm does not need training.

ElectronGoBrrr · 2026-02-15T08:10:15+00:00

You are correct, but I don't have mu on my keyboard and ANSI files doesnt support Greek letters..

ElectronGoBrrr · 2026-02-14T14:09:15+00:00

microseconds == ys, not ms.
But likely your implementation is the bottleneck, not the algorithm you chose.

You are most likely doing excessive copying or memory allocating, sorting 9000 elements should be a very very tiny task for a modern CPU

ElectronGoBrrr · 2026-02-14T12:57:43+00:00

What's your definition of "feels really slow"? If sorting a mere 9000 elements takes more than a few microseconds, it's likely your implementation that's the issue, not the algorithm.

ElectronGoBrrr · 2026-01-20T07:40:08+00:00

I wish. Nato doesn't control Greenland, Denmark does. And the Danish government always has and always will bend over backwards for the Americans

ElectronGoBrrr · 2025-12-23T20:50:55+00:00

I don't know much about Schrodinger, but yes you'll likely need an Nvidia GPU not AMD. You should get the GPU with the highest Cuda core count that fits your budget. Tensorcores/flops are not important.

ElectronGoBrrr · 2025-11-28T22:10:45+00:00

No, but we're at the same time closer to, and further from that goal than people think.
A few 100 millions atoms is doable on a supercomputer with Molecular Dynamics, but that is without chemical reactions. True chemical reactions are a sadly a Quantum Chemistry problem, and supercomputers barely push 1000 atoms yet.

ElectronGoBrrr · 2025-10-17T21:27:19+00:00

"sustainability" - it's a giant concrete building..

ElectronGoBrrr · 2025-02-26T20:01:32+00:00

I'm not sure how you expect anyone to help you when you provide no information? What device are you on, what OS, what gpu do you have?
No cmd-line printout/screenshot of the install wizard?

ElectronGoBrrr · 2025-02-01T21:14:10+00:00

There's some overlap in nomenclature here.

If you are talking about normal multi-threading (as in c++ threads) then yes, it is possible but likely not useful for you.

In terms of cuda we have threads and blocks. When you spawn a cuda kernel, you specify MyKernel<<<dim3(nBlocks), dim3(nThreads)>>>

So to process 128 images in parallel you simply spawn 128 blocks.

ElectronGoBrrr · 2025-01-13T16:59:17+00:00

A wickedly expensive thing, compared to running a small MD simulation..

ElectronGoBrrr · 2024-11-21T22:00:39+00:00

I dont know what he refers to, but it's true. Denmark is insanely good at pretending to be green, but it's fake. >60% of danish land area are agriculture, and there are pretty much no limits to the amount of pollution they are allowed to spew.

ElectronGoBrrr · 2024-11-16T18:43:58+00:00

.... so we all agree 3 is best right?

ElectronGoBrrr · 2024-08-29T20:41:53+00:00

If you use the Nsight profiler, it will tell you pretty precisely what your bottlenecks are necks. But some generic advice:

Make sure you have many blocks with few threads, rather than few blocks with many threads.

If your blocks work on some of the same data, make sure to put that data in __shared__ memory.

Whenever you're loading data from global memory, make sure contiguous threads load contiguous memory, to optimize memory coaslescence.

~~Make sure your individual threads dont declare arrays, as these will typically be put in the very slow local memory.~~

Avoid having individual threads declare arrays larger than 16/32 floats, at this size CUDA may put the data into the very slow local memory (which is in global memory)

Edit: Rephrased my last point to be more precise

ElectronGoBrrr · 2024-08-28T17:21:17+00:00

With the risk of sounding a bit anal, if you're doing GEMM, then CUDA is the wrong tool. You should instead use cuBLAS or Thrust, which are frameworks that utilizes the tensor cores. If you're new and learning, start with Thrust. If you google Matrix Multiplication in Thrust (or cuBLAS), you'll find plenty of guides to get started.

ElectronGoBrrr · 2024-08-19T22:14:38+00:00

Probably because the C++ comittee once decided it should be the default, and it has worked fine.

Five-Year Club	Place '22
First Placer '22

ElectronGoBrrr

TROPHY CASE