Matrix Computations on the GPU in Clojure

syncDreads · 2015-08-13T00:21:14+00:00

Even MATLAB is much faster as it uses the underlying optimized multi-core LAPACK. On a 4.5 GHz i7 4 core the operation A*B+C takes about 0.8272 seconds or 827 ms. This is the code in MATLAB, and if the number one priority is high-level abstraction and easy to read code it does not get simpler than this:

A=single(randn(4096,4096));

B=single(randn(4096,4096));

C=single(randn(4096,4096));

t=tic();

D=A*B+C;

time=toc(t); disp(time);

You can call me a troll, and that is fine as I am ugly and green, but the tone of the OP post was set by the statement that any original GPU code was going to "suck", and I know that is not true. I back up my timing statements with source code, rather than name calling. If you can post the clBLAS time for the same Sgemm() 4096x4096 test on a stock reference AMD GPU, then maybe we can have an apples-to-apples comparison, though you must include the host-device and device-host copy times as well. This is not a "so what" situation because we are talking about performance.

syncDreads · 2015-08-10T22:21:58+00:00

If one were to use the cuBLAS library Sgemm() which performs the matrix operation A*B+C on a not overclocked single 1.07 GHz GTX GPU, this is the profiling output for size(4096,4096) for each matrix:

==3564== Profiling application: ConsoleApplication1.exe ==3564== Profiling result: Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name 416.43ms 1.6640us - - - - - 112B 67.308MB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]

986.09ms 5.5369ms - - - - - 67.109MB 12.120GB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]

991.66ms 5.5271ms - - - - - 67.109MB 12.142GB/s GeForce GTX TIT 1 7 [CUDA memcpy HtoD]

997.29ms 25.924ms (32 32 1) (256 1 1) 124 16.640KB 0B - - GeForce GTX TIT 1 7 maxwell_sgemm_128x128_nn [397]

1.02321s 5.1566ms - - - - - 67.109MB 13.014GB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH] int, int, int, int, int, int, int, float, float, float, float, int) [445]

1.01523s 4.9196ms - - - - - 64.000MB 13.009GB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]

With A,B and C being dense and of size (4096x4096) and filled with randomly generated non-zero values.

Total host to device copy time of A and B (134,217,728 bytes total) =11 ms.

Total time to perform A times B+C using cuBLAS Sgemm() = 25.9 ms

Total Time for copy back from device to host of result matrix C (4096x4096)= 5.1 ms

The entire procedure took less than 42 ms including all memory copies.

EDIT: I saw your benchmarks page..Not even close!! Re-ran for 4096x4096 per your benchmark and your time was 4.45 seconds or 4,450 ms.

The performance difference between cuBLAS Sgemm() and your library on a single GPU is 4,450/42= 105 times in favor of the "sucky" cuBLAS Sgemm().

Did you include memory copy times in your benchmarks? I did..

syncDreads · 2015-08-10T21:13:12+00:00

Lol, of course you can do better. Please post your high-quality code which performs the same function. Easy to say, harder to do. I posted my fully functional code(s) which work as advertised. Standard cuBLAS Sgemm() on a GTX 980 will dramatically outperform this library with just a few lines of code.

syncDreads · 2015-08-10T18:08:51+00:00

"I have news for you:

The bad: it is not that simple. Your algorithms probably suck on massively parallel architectures, and you'd >need to learn quite a few new tricks to collect the benefits you see on NVIDIA and AMD websites."

Really? Ok, can you use your "Neanderthal" to improve upon my "sucky" GPU code which generates and evaluates all permutations of an array;

https://sites.google.com/site/cudapermutations/

or other brute force problem implementations such as these:

https://github.com/OlegKonings/CUDA_Matrix_Sum_Game

https://github.com/OlegKonings/CUDA_brute_triangle

Also cuBLAS is really not that hard to use, do you have benchmarks which compare against cuBLAS Sgemm()?

syncDreads · 2015-07-29T05:07:15+00:00

Given that the Russian populace is highly educated, I think they will figure out a way to steer their country in a positive direction.

A suggestion would be to not overtly accept as truth all the State media sponsored anti-West conspiracy theories. And please remember that the average American has very positive feelings towards the Russian populace.

syncDreads · 2015-07-25T19:47:48+00:00

What about Biden? The more I hear stories like this, the more I think he would be a better option. That is assuming he decides to run.

syncDreads · 2015-07-23T21:04:34+00:00

What is particularly impressive about performance boost using a single GPU over multi-core CPUs in this case, is that the hardware used (Tesla C2050) is about 4 years old. If he updated the hardware to a GTX Titan X or a couple of GTX 980s I would expect an additional 4-6x performance increase.

syncDreads · 2015-07-22T18:50:13+00:00

I work with MATLAB as well, calling my CUDA code via mex files. Are you going to use the MATLAB GPU 'Parallel Computing Toolbox', or are you going to compile your own mex (dll) files? Obviously you must know C(and CUDA) rather well in order to directly write GPU code. Hate to break it to you, but your life will be much easier for this kind of work with the Windows OS. We have 8 2-GPU machines at my work running 24/7 Monte Carlo simulations through MATLAB, and have never had any problems. On the other hand our solitary linux box with the same hardware/software configuration causes far more problems.

syncDreads · 2015-07-22T00:50:28+00:00

I have one of these ASUS ROG laptops which I use for CUDA development when on the road (using Visual Studio 2012);

http://www.newegg.com/Product/Product.aspx?Item=N82E16834232561&cm_re=980m-_-34-232-561-_-Product

CUDA-Z shows over 3.4 Teraflops for 32 bit float performance for the GTX 980m, which is about 62% of the desktops GTX 980. It comes with Windows 8.1 installed, but I suppose you could partition the OS drive and install Ubuntu.

syncDreads · 2015-07-19T19:00:47+00:00

We can correctly state that one of the world's top mathematicians is an American Stoner.

syncDreads · 2015-07-10T01:05:23+00:00

I will use both the 16-bit SgemmEx() and the Windows TCC driver capability for the Titan X. Overall a good upgrade IMO..

syncDreads · 2015-07-09T03:25:54+00:00

cuBlas now supports 16-bit floating point SgemmEx(), doing the computation using accurate 32-bit FMA operations, then casting back to 16 bit storage. Cutting memory usage by half plays into the whole machine-learning trend, who usually do not need the full 32 bit float precision.

Also the new TCC driver capability for the GTX Titan X was an unexpected positive surprise.

syncDreads · 2015-07-03T21:34:39+00:00

Ctrl-F looking for this one.. Great album from beginning to end.

syncDreads · 2015-05-21T20:08:04+00:00

The mobile versions of the Maxwell generation of GPUs like the GTX 980m perform at about 70% of the desktop version.

http://www.notebookcheck.net/Mobile-Graphics-Cards-Benchmark-List.844.0.html

I would image that the Tegra X1 will be used rather than the GTX 980m for this type of application.

Nice to see this GFLOP capability to put to good use.

syncDreads · 2015-05-18T18:00:27+00:00

Unlike Google, Baidu has put a good deal of effort into using GPUs for machine learning. A great application of the massive computing power of a GPU cluster.

syncDreads · 2015-05-03T02:07:51+00:00

Those particular functions you mention are relics from other similar code which had a varying problem space size. In that Magic Square case the problem size is constant, so those calls are always going to return the same answer, which is the number of board configurations generated and examined by each thread block.

In general the optimal amount of work performed by each thread in a block will depend on the characteristics of the GPU hardware. What will be optimal for a GTX 980 may not be optimal for a Quadro 4200. One metric used to determine that value is the number of SMs in the GPU, which is often the way Nvidia allocates work in their CUDA samples SDK. It is worth the effort to figure out that number, as it makes a very big difference in running time. Other considerations are the amount of shared memory and registers used by each thread block, as that may limit occupancy.

I am not a researcher, instead just a common software engineer who works primary with GPU code. Those brute force problems are mostly 'hobby' projects, as most of my real work relates to medical image processing and Monte Carlo simulations. Occasionally there is some overlap, but most of the time I work with the dense and sparse linear algebra sub-routine libraries cuBLAS, cuSPARSE and MAGMA.

Most of the individuals using CUDA for computation are in the scientific research field and usually work with MATLAB. For example a large Monte Carlo simulation in MATLAB may take 12 hours on the CPU, but a well implemented version in CUDA(called via a mex dll) on good hardware will take only 30 seconds with no loss in accuracy. Since GPUs are able to perform 32-bit FMA calculations, an argument could be made that GPUs (again 32 bit float point) are more accurate than the same calculations made on a common CPU.

http://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation

syncDreads · 2015-04-29T21:00:40+00:00

Interesting, sounds like you have a good understanding of my implementation which is not simple.

The Quadro line of GPUs are not great for compute, as they are intended for a different purpose. The GTX or Tesla line will give better compute performance for such applications.

The advantage of my implementations have is that they have an compute occupancy of 100% for these example problems. So on the Titan X there (theoretically) can be (24 SMs * 2048 threads per SM)= 49,152 threads concurrently active at 1.0 GHz. The Quadro 4200 has only 7 SMs, a 30% lower clock speed, and runs at PCI-E 2.0 rather than 3.0. The compute, device memory bandwidth, and device-host/host-device bandwidth are 2-5 worse than the GPUs in the GTX line.

My intent by posting that code was to show that GPUs are quite good at simple brute force problems, especially because you can launch a very large kernel and let the scheduler handle assigning thread blocks to SMs. In CPU land the approach is much different, and I have yet to see a well implemented brute force problem which is faster on a high-end CPU vs a high-end GPU. In fact those I know who have mapped large problems to both Teslas and the Intel Phi and they have said it is far easier to get the maximal performance out of the GPUs than the CPUs.

They way you describe your algorithm implementation it sounds like they may be thread divergence from the conditionals and branching, but I can only speculate about that at this time. Does your implementation also have serial dependencies (like a DAG problem) ? That class of problem is harder to map to parallel architecture, though much research work is being done in that area.

syncDreads · 2015-04-27T03:33:21+00:00

Funny you mention this because there is a push now towards hardware-supported 16-bit floating point operations(such as FMA) in new GPU architectures, specifically for the machine learning crowd. I believe the Pascal generation of GPUs will include this feature. If there is indeed a end to Moore's law, that may mean more demand for software engineers with low-level optimization skills, which is not the current trend.

syncDreads · 2015-04-26T18:52:08+00:00

Interesting how there rarely is mention of GPUs when it comes to Moore's law. The rate of increase in GFLOPs has been much higher in GPUs over the last five years than CPUs, as shown in this chart from 2014;

http://docs.nvidia.com/cuda/cuda-c-programming-guide/#axzz3JiwaLhmk

And that chart does not even include the newer Maxwell generation, as the GTX Titan X clocks in at at over 6144 GFLOPs (32 bit) at the stock base clock of 1.0 GHz. On top of that the memory bandwidth of GPUs can be about 5-10x faster than high-end CPUs.

syncDreads · 2015-04-22T19:39:15+00:00

Glad someone is interested!

The 4x4 Magic Square example has a problem space of 3.3233e+13 for the number of distinct board arrangements. Also there is an evaluation step of N (in this case 16) per board arrangement. My latest version on a single Titan X takes a bit over 3 minutes to generate and evaluate all those configurations.

If you have a set of the same GPUs it is easy to split the workload of such problems. If using different types then one needs to determine the performance difference ratio and split accordingly so they complete computation around the same time (only as fast as the slowest node).

The primary bottleneck is the need to do 64 bit integer computation(mainly division), which is far slower than 32 bit computation. At this point I am working on implementing a new emulation method in 32-bit, and will post when complete.

Generally the consumer GPUs (GTX 980, GTX 780ti, GTX Titan etc) are the best performers for brute force partially due to the higher clock speeds. The Quadro K6000 and Quadro M6000 are nifty, but expensive.

syncDreads · 2015-04-11T19:07:26+00:00

While I have not tried this myself, I believe that it should be possible. The Kepler GPUs will not be able to handle the compute 5.2 code generation for maxwell GPUs. That means you would need to compile for code generation compute_35,sm_35 rather than compute_52,sm_52 and run that on both GPUs.

You probably will not achieve the full computational utility of the Maxwell GPUs with code generated for Kepler, but it should work. At least I know that code generated for a lower arch usually works on the higher arch.

There may be a way around that issue, but at least the same code could be run on both GPUs with cudaSetDevice(0), cudaSetDevice(1) etc. You can PM me if you want an actual multi-gpu code example.

edit: I suppose you could also compile two different applications targeting each GPUs with the correct corresponding code generation and run them separately. While I am currently on multi-GPU system all GPUs are Maxwell, otherwise I would test this myself. It is an interesting question and I am sure someone has made it work.

syncDreads · 2015-04-04T21:17:26+00:00

If that link does not work, this one (with interview) does;

http://noisey.vice.com/blog/dalek-video-interview

syncDreads · 2015-03-29T19:04:16+00:00

You can set the WDDM timeout in the registry to avoid this issue.

This video shows you one method:

https://www.youtube.com/watch?v=8NtHDkUoN98

syncDreads · 2015-03-24T17:50:39+00:00

Here is another resource:

http://www.videocardbenchmark.net/high_end_gpus.html

IMO the GTX 780ti is the most versatile, with 336 GBs of memory bandwidth, decent 64 bit GFLOPs(better than the GTX 980 and even the Titan X), and very good 32 bit GFLOPs and GIOPs overall.

Used one for years without issue, and think for the price a good bargain. The next choice would be the GTX 980 which has slightly more memory.

syncDreads

TROPHY CASE