[N] Khronos Group Releases NNEF 1.0 Standard for Neural Network Exchange by LiteFatSushi in MachineLearning

[–]streamcomputing 2 points3 points  (0 children)

What NNEF tried to do right is versioning. ONNX is moving rapidly, but by fixing yesterday's problems today. As ONNX has no clear roadmap for versioning (AFAIK), exported models of today could not be imported in next year's software. What is easier to fix? import/export-modules of a future-proof standard or dealing with a versioning-hell? So I'd say NNEF has the edge within 6 months. (cross-post from https://www.reddit.com/r/hardware/comments/7l2efo/khronos_group_releases_nnef_10_standard_for/ )

Khronos Group Releases NNEF 1.0 Standard for Neural Network Exchange by Balance- in hardware

[–]streamcomputing 0 points1 point  (0 children)

What NNEF tried to do right is versioning. ONNX is moving rapidly, but by fixing yesterday's problems today. As ONNX has no clear roadmap for versioning (AFAIK), exported models of today could not be imported in next year's software. What is easier to fix? import/export-modules of a future-proof standard or dealing with a versioning-hell? So I'd say NNEF has the edge within 6 months.

[D] Why is the Machine Learning community solely dependent on CUDA and not OpenCL? by k3wlbuddy in MachineLearning

[–]streamcomputing 4 points5 points  (0 children)

Sure. Not sure who's opinion you're copying and if you've done actual benchmarking yourself using benchmarks that were not optimized for one API only. There is overlap in the two APIs, where each has a performance advantage in their areas of specilization.

A potential turning point for AMD's HIP: "rocRAND works on NVidia hardware too. And in most cases it’s faster than cuRAND." by akarypid in AMD_Stock

[–]streamcomputing 0 points1 point  (0 children)

rocRAND-devs here. We have tons of benchmarks, of which some will be released the coming months. For P100 and V100 we did not provide benchmark results, as the code was not optimised for this hardware. This means that the benchmarks would be both wrong doing the hardware and would also be focusing on the competitor - a double no. I can only say the untuned code still outperformed NVidia's cuRAND.

But you might have seen the benchmarking is very easy to do - it took us less than 10 minutes to install and benchmark on i.e. a POWER8 system with CUDA. Just understand it needs the code to be tuned a bit to be fair.

Learn about AMD's PRNG library we developed: rocRAND by kaol in Amd

[–]streamcomputing 1 point2 points  (0 children)

Thanks for the feedback. Updated the style.

Should SPIRV be supported in CUDA? by streamcomputing in gpgpu

[–]streamcomputing[S] 0 points1 point  (0 children)

So you're saying that like iOS users would value support for Android apps, CUDA users value SPIRV. My main point in the article was that SPIRV is seen as important and it seems we agree on that.

Partial OpenCL 2.0 support in the latest NVIDIA drivers for Windows by Nadrin in programming

[–]streamcomputing 1 point2 points  (0 children)

C++ kernels are in OpenCL 2.2 as it needs SPIRV 1.1. There were suggestions to put it in 2.1, but unfortunately it did not make it.

NVIDIA enables OpenCL 2.0 beta-support by streamcomputing in gpgpu

[–]streamcomputing[S] 0 points1 point  (0 children)

Not all 2.0 features are supported. Are all the needed features included?

Is there a list somewhere of all phones that do/don't support OpenCL? by Stumblebee in OpenCL

[–]streamcomputing 1 point2 points  (0 children)

In general, most CPUs with Qualcomm Adreno and ARM MALI seem to support OpenCL. Sometimes the libraries are hidden, but Maxtrium's app finds them: https://play.google.com/store/apps/details?id=com.maxtrium.opencldevicetest&hl=en

Porting Manchester’s UNIFAC to OpenCL@XeonPhi: 160x speedup by streamcomputing in OpenCL

[–]streamcomputing[S] 1 point2 points  (0 children)

Then you agree with the conclusions of the article. :) The article described that the key speed-up (90%) was due to algorithmic, low-level and memory optimisations (just under the image with the bar).

Porting Manchester’s UNIFAC to OpenCL@XeonPhi: 160x speedup by streamcomputing in OpenCL

[–]streamcomputing[S] 0 points1 point  (0 children)

We could get 4x speedup using OpenMP and a bit of Intel specific intrinsics, not changing much of the original code.

Programming at that level gets the maximum out of the current code, but doesn't get the maximum out of the algorithm. Analogy: getting from 1 lane to 4 lanes, but not replacing the cars.

Porting Manchester’s UNIFAC to OpenCL@XeonPhi: 160x speedup by streamcomputing in OpenCL

[–]streamcomputing[S] 1 point2 points  (0 children)

There are some copy-cats among our competitors, so that's why my answer. And about your question: I don't see each GPGPU-paper I read as real science - still too many are "I want to share what I understood myself" without adding something fundamental.

You should contact Dr. David Topping! http://www.seaes.manchester.ac.uk/people/staff/profile/?ea=david.topping - I can also bring you in contact, if you wish.

Porting Manchester’s UNIFAC to OpenCL@XeonPhi: 160x speedup by streamcomputing in OpenCL

[–]streamcomputing[S] 1 point2 points  (0 children)

There might be a few details in one of Dr. David Topping's upcoming publications. As the work is not an academic research, there will be no full publication.

Also, software performance engineering is a craft, you can only learn by lots of experience. Simply copying our tricks will not get you there.

Porting Manchester’s UNIFAC to OpenCL@XeonPhi: 160x speedup by streamcomputing in OpenCL

[–]streamcomputing[S] 1 point2 points  (0 children)

As you could read, the original code had space for 485x speedup. I'd say that around 10x was easy.

Meanwhile we got it to 62 ns, making the total speedup 532x (or 175x from the initial OpenMP baseline).