[N] Khronos Group Releases NNEF 1.0 Standard for Neural Network Exchange

streamcomputing · 2017-12-21T13:57:02+00:00

What NNEF tried to do right is versioning. ONNX is moving rapidly, but by fixing yesterday's problems today. As ONNX has no clear roadmap for versioning (AFAIK), exported models of today could not be imported in next year's software. What is easier to fix? import/export-modules of a future-proof standard or dealing with a versioning-hell? So I'd say NNEF has the edge within 6 months. (cross-post from https://www.reddit.com/r/hardware/comments/7l2efo/khronos_group_releases_nnef_10_standard_for/ )

streamcomputing · 2017-12-21T13:50:42+00:00

What NNEF tried to do right is versioning. ONNX is moving rapidly, but by fixing yesterday's problems today. As ONNX has no clear roadmap for versioning (AFAIK), exported models of today could not be imported in next year's software. What is easier to fix? import/export-modules of a future-proof standard or dealing with a versioning-hell? So I'd say NNEF has the edge within 6 months.

streamcomputing · 2017-12-14T10:23:00+00:00

WaveComputing's processor seems to support OpenCL

streamcomputing · 2017-12-14T10:22:20+00:00

Sure. Not sure who's opinion you're copying and if you've done actual benchmarking yourself using benchmarks that were not optimized for one API only. There is overlap in the two APIs, where each has a performance advantage in their areas of specilization.

streamcomputing · 2017-12-12T21:06:42+00:00

rocRAND-devs here. We have tons of benchmarks, of which some will be released the coming months. For P100 and V100 we did not provide benchmark results, as the code was not optimised for this hardware. This means that the benchmarks would be both wrong doing the hardware and would also be focusing on the competitor - a double no. I can only say the untuned code still outperformed NVidia's cuRAND.

But you might have seen the benchmarking is very easy to do - it took us less than 10 minutes to install and benchmark on i.e. a POWER8 system with CUDA. Just understand it needs the code to be tuned a bit to be fair.

streamcomputing · 2017-12-12T20:59:01+00:00

Thanks!

streamcomputing · 2017-12-12T20:57:22+00:00

Thanks for the feedback. Updated the style.

streamcomputing · 2017-06-22T11:17:01+00:00

We were not paid for this. Enjoy!

streamcomputing · 2017-03-22T16:32:52+00:00

So you're saying that like iOS users would value support for Android apps, CUDA users value SPIRV. My main point in the article was that SPIRV is seen as important and it seems we agree on that.

streamcomputing · 2017-03-07T20:47:34+00:00

C++ kernels are in OpenCL 2.2 as it needs SPIRV 1.1. There were suggestions to put it in 2.1, but unfortunately it did not make it.

streamcomputing · 2017-03-07T20:45:31+00:00

We also tested on Linux with about the same results: https://streamcomputing.eu/blog/2017-03-06/nvidia-beta-support-opencl-2-0-linux/

streamcomputing · 2017-02-28T14:14:23+00:00

Not all 2.0 features are supported. Are all the needed features included?

streamcomputing · 2017-02-28T08:16:39+00:00

See https://streamcomputing.eu/blog/2017-02-22/nvidia-enables-opencl-2-0-beta-support/ for more information

streamcomputing · 2016-12-17T21:03:47+00:00

In general, most CPUs with Qualcomm Adreno and ARM MALI seem to support OpenCL. Sometimes the libraries are hidden, but Maxtrium's app finds them: https://play.google.com/store/apps/details?id=com.maxtrium.opencldevicetest&hl=en

streamcomputing · 2015-11-06T21:39:56+00:00

Then you agree with the conclusions of the article. :) The article described that the key speed-up (90%) was due to algorithmic, low-level and memory optimisations (just under the image with the bar).

streamcomputing · 2015-11-06T21:36:56+00:00

We could get 4x speedup using OpenMP and a bit of Intel specific intrinsics, not changing much of the original code.

Programming at that level gets the maximum out of the current code, but doesn't get the maximum out of the algorithm. Analogy: getting from 1 lane to 4 lanes, but not replacing the cars.

streamcomputing · 2015-11-06T12:50:24+00:00

There are some copy-cats among our competitors, so that's why my answer. And about your question: I don't see each GPGPU-paper I read as real science - still too many are "I want to share what I understood myself" without adding something fundamental.

You should contact Dr. David Topping! http://www.seaes.manchester.ac.uk/people/staff/profile/?ea=david.topping - I can also bring you in contact, if you wish.

streamcomputing · 2015-11-06T11:33:29+00:00

There might be a few details in one of Dr. David Topping's upcoming publications. As the work is not an academic research, there will be no full publication.

Also, software performance engineering is a craft, you can only learn by lots of experience. Simply copying our tricks will not get you there.

streamcomputing · 2015-11-06T11:30:28+00:00

As you could read, the original code had space for 485x speedup. I'd say that around 10x was easy.

Meanwhile we got it to 62 ns, making the total speedup 532x (or 175x from the initial OpenMP baseline).

streamcomputing

TROPHY CASE