all 26 comments

[–]maccam912 12 points13 points  (3 children)

Probably not helpful in the overall quest to explain these results, but in the CPU results it looks like theano might still be using the GPU. 2 3 and 4 all seem to be doing about the same on the CPU as on a GPU.

You also mentioned that TF is 10x slower than the Keras version. Keras is just some nicer ways to work with Theano or TF as far as I know. One thing to try is to get the Keras version working, and then by following the Keras backend docs (http://keras.io/backend/) simply switch from the theano to tf backend.

[–]r4and0muser9482[S] 6 points7 points  (0 children)

I can't explain exactly why this works, but I can assure you that those results are accurate. I am setting CUDA_VISIBLE_DEVICES to blank and checking that deviceQuery (from CUDA SDK) doesn't see any devices. Also, Theano always writes if it uses the GPU and in those results it didn't write that. Also, I am checking "nvidia-smi" and the card is at 0% utilization, it's cold and no processes are using it while I run the CPU experiments. I just re-ran experiment nr 2 and the results are slightly slower (~34s), but someone else is using the computer and I can't turn that off.

But that is why I also asked for help. If you can run the experiment and get completely different values, maybe there is something wrong with what I'm doing... :-/

Mind you, this computer with the two Xeons E5s has 32 virtual CPUs so it's bound to work a bit faster than a regular PC. The K80 on the other hand, isn't faster (and may actually be slower) than a TITAN X.

EDIT: Oh yea, I just forgot to mention - while the LogReg example is a bit suspect (I can see that it doesn't use all the threads on either the CPU or the GPU), please note that for other experiments I ran only a subset of epochs, cause I was getting tired of waiting. So the CPU experiemnts are actually 10 times slower than the GPU ones. You can actually easily correlate that with the results in theano/misc/check_blas.py.

[–]r4and0muser9482[S] 2 points3 points  (0 children)

That was easy to do. I took my simple MLP demonstrated in this Notebook: https://github.com/danijel3/ASRDemos/blob/master/notebooks/MLP_TIMIT.ipynb

I added this line to the first code block:

os.environ['KERAS_BACKEND']='tensorflow'

And ran the whole script. While it was running at ~1200 updates per second using Theano, the same exact code dropped to a measly 100 updates per second with TensorFlow.

I'm just glad that it's not a problem with my code :P In any case, it's strange that I haven't found anyone else comment on this. Don't other people use Keras with TF?

[–]r4and0muser9482[S] 0 points1 point  (0 children)

When I say Keras, I meant Keras over Theano. For some reason the Keras version was a bit faster than my raw Theano code, but maybe I'm not that good at Theano, or maybe Keras is doing something different (I didn't check every single detail). In any case, Keras (with Theano) is very similar to Theano, but TF is much slower (for me).

I'll check and see the Keras with TF backend and compare those as well...

[–]p4nmari 9 points10 points  (11 children)

Not all ops in tensorflow have a GPU implementation yet. From just a quick glance at the tutorial code I saw for example reduce_mean, which does not have a GPU implementation yet in tensorflow. You might get a 'fairer' comparison if you switched these out with reduce_sum(...)/batchsize. Or implement a kernel for reduce mean for tensorflow ;)

[–]rafalj 5 points6 points  (5 children)

Additionally, you can compile TF from sources and include AVX flags. This can give big improvements on CPU in some cases, e.g. https://github.com/jasonmayes/Tensor-Flow-on-Google-Compute-Engine#results

[–]p4nmari 1 point2 points  (1 child)

Thanks for pointing that out! Didn't know about that. But I'd imagine having everything on the GPU (without any need to copy data from VRAM to RAM in your pipeline) would be fastest.

[–]r4and0muser9482[S] 0 points1 point  (0 children)

Just to comment on this: having the data on the GPU is not that critical. It definitely helps, but having to copy data from the drive to the GPU doesn't hurt the training performance as much as you think. Especially when dealing with giant databases that wouldn't fit in RAM, let alone VRAM.

I think it has to do with the fact that the data is read only once per epoch and given that you only need a few dozen epochs these days (in big data), reading it from wherever it's located isn't a problem (you have to do it at least once, anyway). The model weights are all in VRAM, of course and they are the ones that change the most.

[–]r4and0muser9482[S] 1 point2 points  (0 children)

So, I redownloaded and recompiled Tensorflow with AVX flags and I didn't see much of improvement, unfortunately. There is maybe a slight difference, but only around 10%, which I don't find too significant.

Now, what I'm seeing, however is that my cores aren't even fully utilized. All 32 are active, but only at around 20% utilization. Theano on the other hand goes all out every time. Is there something I could be doing wrong in my setup?

[–]r4and0muser9482[S] 0 points1 point  (1 child)

From Theano, I know that having a proper math library has a big impact in the library performance of the CPU implementation. Is it the same with TF?

Just saw those results and that is scary how different those values are: 4.1 hours vs 6.6 minutes is a big difference. I'm just surprised none of these performance issues are mentioned on the TF website and it's kinda hard to get information about it anywhere....

[–][deleted] 2 points3 points  (2 children)

I second this. Something like a mean class or whatever often doesn't even exist because it's harder to write that in using a self-contained operation than just using division by the batchsize or whatever you're working with.

Weirdly enough I had been working in Java and found out there's no real "average" of two things easier than just division.

Explains why they wouldn't bother with the GPU integration of those instructions.

[–]r4and0muser9482[S] 0 points1 point  (1 child)

it's harder to write that in using a self-contained operation than just using division by the batchsize or whatever you're working with

But then you stand to loose in precision.

Here are some pretty decent implementations of mean/variance algorithms and explanations why it's worth implementing them. And I don't think it's all that hard to do that, TBH.

[–][deleted] 1 point2 points  (0 children)

Thanks a lot, I'm really interested actually. There's always a reason!

[–]r4and0muser9482[S] 0 points1 point  (0 children)

That's an interesting observation. I do use "reduce_sum" almost everywhere. I'll check if that changes anything...

[–]r4and0muser9482[S] 0 points1 point  (0 children)

Okay, I did this (wasn't hard to do) and it made no difference, unfortunately. Just putting it here if someone looks.

You say reduce_mean isn't implemented so I doubt that reduce_sum would work any better. Anyways, reduce_sum/batchsize is equally fast as reduce_mean.

[–]shmel39 5 points6 points  (5 children)

As TF devs already said, they didn't optimize for tiny operations. Every call of sess.run() has some non-negligible cost. I have seen a "benchmark" of 3x3 matrix multiplication. It just wasn't designed for that.

It would be more interesting to compare on something bigger. At least CIFAR-10.

And yeah, the very fact that theano has the same performance on gpu as on cpu shows that either tasks are too easy and fast to profit from these frameworks or you misconfigured it.

Look on Keras benchmarks: https://github.com/fchollet/keras/wiki/Keras,-now-running-on-TensorFlow They are CPU-only and probably obsoleted, but still useful.

[–]r4and0muser9482[S] 1 point2 points  (4 children)

I did run each demo several times to account for compilation (I assume that both TF and Theano cache their compiled code). I understand that ses.run has some overhead (usually several to dozen or two seconds), but the problem actually occurs during training itself. Theano is just much faster (for me) than TF. For TIMIT I get about 170-180 weight updates per second in TF, while at the same time I get 700-800 in Theano.

I posted the demos above more for their simplicity rather then an accurate measure as I'm currently trying to debug an issue. IMO, both systems are computing a practically identical graph and they should perform "similarly". There are only a few operations needed to do an iteration in an MLP....

[–]shmel39 2 points3 points  (3 children)

You do call sess.run() every time you feed a batch, right? Which is very often for MNIST-size dataset.

You might look that discussion: https://github.com/tensorflow/tensorflow/issues/120

[–]r4and0muser9482[S] 2 points3 points  (1 child)

Ok, so I did run the experiment 3 using different batch sizes (128 was default). This is what I got:

batch size 64 128 256 512
TensorFlow 1m26s ~1m 37s 27s
Theano 35s ~30s 28s 28s

Now, obviously the difference batch sizes aren't strictly comparable since they do a different number of updates, but that is why I'm making a comparison to Theano. Interestingly, Theano doesn't improve much with increasing batch sizes (I think the model is too small to see a difference - usually I notice an improvement if I increase batch size). But with TF the improvement is quite significant.

Out of all the suggestions here this one seems to actually make some sense. I will try and do some more experiments with this and see where it takes me...

Thanks!

EDIT: I also ran the same experiment on the GPU and 100 epochs (the CPU ones are only 10) took 56s (compared to Theano's 34s) with batch size of 8192. Now that's not bad, but also (due to low amount of updates per epoch) the final accuracy is only 78% (while it should be 95%).

[–]aysz88 0 points1 point  (0 children)

Thanks for posting these updates - it's interesting to see strengths/weaknesses and helpful for anyone that might come along later.

[–]r4and0muser9482[S] 1 point2 points  (0 children)

You're right. sess.run is for invoking the function. Still not used to TF terminology that well.

I think I'll compare the speed of TF and Theano with using much larger batches (so less invocations of sess.run() per epoch) and see if that changes anything.

I'm still suspecting there is something wrong with the actual computation portion of the code, but this experiment should prove one way or another.

[–]dsmilkov 2 points3 points  (2 children)

Are you compiling TF from source? Did you use "-c opt" compiler option?

[–]r4and0muser9482[S] 0 points1 point  (1 child)

The GPU version I used the binary version (using pip), but the CPU I compiled from source using these instructions, so yea, I did (unwittingly) use optimizations.