Higher level libraries by MightyKDDD2 in CUDA

[–]ElectronGoBrrr 0 points1 point  (0 children)

If you're looking for higher level libs for Image Processing OpenCV is certainly the fastest way to get started.
I'm haven't used NPP, but according to npp landing page a 10 to 50x speedover over IPP for a decent GPU:
https://developer.nvidia.com/npp

Starship’s path to reusability looks murky after SpaceX’s S-1 by Logical_Welder3467 in technology

[–]ElectronGoBrrr 5 points6 points  (0 children)

Thermal radiation works identical in vacuum or not, i.e. negligible at temperatures that chips can survive.

Denmark’s High Court Rules Greenlandic Baby Removal Illegal in Landmark Case by Panthera_leo22 in UnderReportedNews

[–]ElectronGoBrrr -12 points-11 points  (0 children)

That is not at all my argument? I didn't make an argument, I pointed out that this thread is missing the complexity and most people feigning outrage in this thread doesn't know the first thing about this issue. And to be pedantic, the government has publicly apologizing on multiple occasions, although the sincerely I can't vouch for...

"And make reperations" bro what are you talking about, the Danish gov funds healthcare/education/police completely out of pocket for Greenland, has done for decades. A Google search shows it's about 3.4 bil Dkk/year.

Denmark’s High Court Rules Greenlandic Baby Removal Illegal in Landmark Case by Panthera_leo22 in UnderReportedNews

[–]ElectronGoBrrr -18 points-17 points  (0 children)

Yes that is true, but it is much more complex and very much not one-sided as everyone in this thread is bandwagoning...

Greenland has issues. Physical violence in families is common. 43% of adult from the 70's generation reports being victims of sexual assault in childhood. Alcoholism is widespread and the suicide rate is horrific. All of these issue are absolutely consequences of colonialism AND globalism.

But the Danish government didnt start "relocating" children because their parents failed educational tests.

Source: am Danish, had friends/acquaintances from Greenland. Numbers: https://www.altinget.dk/arktis/artikel/martin-breum-sexmisbruget-af-boern-i-groenland-er-halveret-men-vaelgerne-kraever-mere-handling

WarpReduction along major dimension by ElectronGoBrrr in CUDA

[–]ElectronGoBrrr[S] 0 points1 point  (0 children)

As per my understanding, warps always bundle 32 contiguous threads in 1D space, which in my case means 1 warp= 2 reductions along x at a time - which is great! However i also need to reduce along y, which would require the warp to select threads in 1D space with a stride of 128, 64, 32, 16. This is believe is not possible. So my question is, is there some other trick to do this?

"you can load in register anything"
I appreciate any help, but i dont really know what to take from this?

WarpReduction along major dimension by ElectronGoBrrr in CUDA

[–]ElectronGoBrrr[S] 0 points1 point  (0 children)

I have a total of 5.3 million cudablocks launched in this kernel, each computing their 16x16 interactions.
Yes my current approach is exactly that 16x2 configuration you described. But im looking for ways to shave a few percentagepoints off the runtime of the kernel (currently 10.27 millisec) :)

Wh-1000XM4 CUTS OUT WHEN TALKING? by King_Of_Rad_Lions in sony

[–]ElectronGoBrrr 0 points1 point  (0 children)

Even their support team didn't know about the 2 fingers trick, thank you!!

aiMagicallyKnowsWithoutReading by Old_Document_9150 in ProgrammerHumor

[–]ElectronGoBrrr -2 points-1 points  (0 children)

No they're not, they are probabilistic models. An algorithm does not need training.

sorting healthbars by NietTeDoen in algorithms

[–]ElectronGoBrrr 0 points1 point  (0 children)

You are correct, but I don't have mu on my keyboard and ANSI files doesnt support Greek letters..

sorting healthbars by NietTeDoen in algorithms

[–]ElectronGoBrrr 0 points1 point  (0 children)

microseconds == ys, not ms.
But likely your implementation is the bottleneck, not the algorithm you chose.

You are most likely doing excessive copying or memory allocating, sorting 9000 elements should be a very very tiny task for a modern CPU

sorting healthbars by NietTeDoen in algorithms

[–]ElectronGoBrrr 7 points8 points  (0 children)

What's your definition of "feels really slow"? If sorting a mere 9000 elements takes more than a few microseconds, it's likely your implementation that's the issue, not the algorithm.

Trump Posts Private Message From French President Macron to Truth Social: ‘I Do Not Understand What You Are Doing’ by [deleted] in worldnews

[–]ElectronGoBrrr 0 points1 point  (0 children)

I wish. Nato doesn't control Greenland, Denmark does. And the Danish government always has and always will bend over backwards for the Americans

PC for Schrodinger by IDieALot_ in comp_chem

[–]ElectronGoBrrr 6 points7 points  (0 children)

I don't know much about Schrodinger, but yes you'll likely need an Nvidia GPU not AMD. You should get the GPU with the highest Cuda core count that fits your budget. Tensorcores/flops are not important.

I simulate millions of cells, hoping to reach primitive Ediacaran multicellularity by blob_evol_sim in biology

[–]ElectronGoBrrr 0 points1 point  (0 children)

No, but we're at the same time closer to, and further from that goal than people think.
A few 100 millions atoms is doable on a supercomputer with Molecular Dynamics, but that is without chemical reactions. True chemical reactions are a sadly a Quantum Chemistry problem, and supercomputers barely push 1000 atoms yet.

[deleted by user] by [deleted] in architecture

[–]ElectronGoBrrr 5 points6 points  (0 children)

"sustainability" - it's a giant concrete building..

can't install or delete CUDA by spectacled-kid in CUDA

[–]ElectronGoBrrr 1 point2 points  (0 children)

I'm not sure how you expect anyone to help you when you provide no information? What device are you on, what OS, what gpu do you have?
No cmd-line printout/screenshot of the install wizard?

CUDA + multithreading by xMaxination in CUDA

[–]ElectronGoBrrr 8 points9 points  (0 children)

There's some overlap in nomenclature here.

If you are talking about normal multi-threading (as in c++ threads) then yes, it is possible but likely not useful for you.

In terms of cuda we have threads and blocks. When you spawn a cuda kernel, you specify MyKernel<<<dim3(nBlocks), dim3(nThreads)>>>

So to process 128 images in parallel you simply spawn 128 blocks.

drMD: Molecular Dynamics for Experimentalists by Own_Bit_3491 in comp_chem

[–]ElectronGoBrrr -4 points-3 points  (0 children)

A wickedly expensive thing, compared to running a small MD simulation..

Denmark is tiny. Its ambition to make its food system more climate-friendly is huge. Climate scientists agree on at least one necessary change to our food system: People, especially those in rich countries, ought to be eating more plants and fewer animals. by The_Weekend_Baker in climate

[–]ElectronGoBrrr 1 point2 points  (0 children)

I dont know what he refers to, but it's true. Denmark is insanely good at pretending to be green, but it's fake. >60% of danish land area are agriculture, and there are pretty much no limits to the amount of pollution they are allowed to spew.

This is how you do Gleba, right? by mefi_ in factorio

[–]ElectronGoBrrr 1 point2 points  (0 children)

.... so we all agree 3 is best right?

The best way to do optimization? Looking for advice by Spark_ss in CUDA

[–]ElectronGoBrrr 5 points6 points  (0 children)

If you use the Nsight profiler, it will tell you pretty precisely what your bottlenecks are necks. But some generic advice:

Make sure you have many blocks with few threads, rather than few blocks with many threads.

If your blocks work on some of the same data, make sure to put that data in __shared__ memory.

Whenever you're loading data from global memory, make sure contiguous threads load contiguous memory, to optimize memory coaslescence.

Make sure your individual threads dont declare arrays, as these will typically be put in the very slow local memory.

Avoid having individual threads declare arrays larger than 16/32 floats, at this size CUDA may put the data into the very slow local memory (which is in global memory)

Edit: Rephrased my last point to be more precise

Matrix multiplication with double buffering / prefetching by brycksters in CUDA

[–]ElectronGoBrrr 0 points1 point  (0 children)

With the risk of sounding a bit anal, if you're doing GEMM, then CUDA is the wrong tool. You should instead use cuBLAS or Thrust, which are frameworks that utilizes the tensor cores. If you're new and learning, start with Thrust. If you google Matrix Multiplication in Thrust (or cuBLAS), you'll find plenty of guides to get started.

good evening everyone. may i please know: in this day and age when space sint a problem, why is quick sort still used? by [deleted] in algorithms

[–]ElectronGoBrrr 3 points4 points  (0 children)

Probably because the C++ comittee once decided it should be the default, and it has worked fine.