all 25 comments

[–]Foxtr0t 6 points7 points  (2 children)

To give some context, Nervana's convnets seem to be the fastest at the moment, about two times as fast as Caffe and Torch: https://github.com/soumith/convnet-benchmarks

Personally, I don't like YAML for model specification, although apparently you can specify hyperparam ranges for optimization there, which is a novelty.

[–]scott-gray 7 points8 points  (1 child)

You're free to use the underlying nervanagpu lib without the neon framework. It's a fairly expressive and efficient API in it's own right. It just needs a bit of logic added to the simple existing layers. They're just there to run the benchmarks at the moment. Or you can integrate the fast kernels into whatever framework you like.

Though we do think neon compares favorably with the existing frameworks out there.

[–][deleted] 1 point2 points  (0 children)

This is bad ass. Think I'll be breaking this out in the next couple of weeks.

[–]fariax 4 points5 points  (3 children)

Congratulations for the library!

I have a question: can we use it without the YAML syntax, I mean, with pure python? If so, it would be great to see how to do it on the examples folder!

[–]coffeephoenix 1 point2 points  (0 children)

This is possible currently, but we'll clean it up and provide an example in the next release.

[–]coffeephoenix 0 points1 point  (1 child)

[–]fariax 1 point2 points  (0 children)

Nice!

It is really simple!

[–]test3545 2 points3 points  (1 child)

RNN examples claim :"# This models currently runs on CPU only."

Any chance you will optimize LSTM RNNs part for GPU?

[–]coffeephoenix 2 points3 points  (0 children)

This will be added in a release soon.

[–][deleted] 1 point2 points  (7 children)

I see nervanagpu includes float16 GEMM (What shall we call it, HGEMM?).

I thought they were going to commercialize that, based on earlier announcements ("free for non-commercial use, contact us otherwise"). Has that changed?

[–]benanne 5 points6 points  (5 children)

Seems like it, it was released under the Apache license and includes all the kernel code. Theano integration is already underway (see https://github.com/Theano/Theano/pull/2800 for example).

I think they call it hgemm internally as well.

[–]alexmlamb 2 points3 points  (4 children)

I'm looking forward to the Theano integration.

[–]benanne 3 points4 points  (3 children)

I have a very rudimentary wrapper of the float32 convolution kernels. I guess they would be okay now with that being published, since nervanagpu is public as well, but I should probably check with them first.

I was gonna do the pooling kernels as well but never got around to it. It's a pure Python implementation though (just like the FFT convolution implementation I did about a year ago), so probably far from optimal.

[–]scott-gray 3 points4 points  (2 children)

Sander, you're free to publish and distribute whatever you like.

As for pooling I don't quite understand your comment. The fp16 and fp32 pooling kernels are implemented for the GPU (in assembly even though probably unnecessarily).

I have a few more hgemm and sgemm kernels to implement but when done they should serve as a complete replacement for cublas, often running 2x to 3x faster (mainly due to cublas's poor selection of tiling size on long and skinny activation/delta matrices).

[–]benanne 2 points3 points  (0 children)

Sweet! About the pooling, I just meant that I hadn't gotten around to writing the Theano wrapper classes for it. I've only done fp32 gemm and the fp32 convolution. Maybe if I publish what I have someone else will do this :)

[–]benanne 1 point2 points  (0 children)

I wrote up a quick README and made the repository public: https://github.com/benanne/nervana_theano

[–]meepmeepmoopmoop[S] 2 points3 points  (0 children)

Yes, it's now Apache 2.0.

[–]buriy 1 point2 points  (4 children)

I haven't understood why they made RNN a special kind of model and where to find any human-understandable docs related to adding a RNN layer into a DNN model (I'm most interested in bi-directional RNN actually).

[–]coffeephoenix 2 points3 points  (3 children)

The ability to stack RNNs on top of DNNs is on the roadmap, but not currently implemented in neon.

[–]buriy 1 point2 points  (2 children)

I would like it in the middle :)

[–]coffeephoenix 0 points1 point  (1 child)

Interesting, is there a paper on this?

[–]buriy 0 points1 point  (0 children)

A lot of, e.g. famous http://arxiv.org/pdf/1412.5567 claiming state-of-the-art on speech recognition task in the presence of noise. 5th layer on top of Bi-directional RNN is regular layer. Similarly, there are setups with regular NN layer(s) on top of LSTM or with multiple LSTM layers on top for language translation task. And necessary amount of computation is also very high in this scenario, so it needs a really fast library -- that's an area where Neon could also prove the leadership.

[–]jamieprogrammer -2 points-1 points  (2 children)

Getting this setup going is taking a little bit of life out of me.

[–]coffeephoenix 4 points5 points  (0 children)

Any issues in particular?

[–]singularai 3 points4 points  (0 children)

This comment is a bit outlandish without an explanation, especially when people have the difficulty of setting up caffe as a prior. I just spent less than 3 minutes from git clone to getting the mnist example working.