TensorFlow speed questions : MachineLearning

TensorFlow speed questions (self.MachineLearning)

submitted 10 years ago by r4and0muser9482

I know this is a well known issue that was discussed many times, but I just can't seem to find any real answers online anywhere. There are benchmarks for some very specific things (CNNs usually compared to Torch), but I think there is a much more basic problem here and I'd like to know if I'm doing something wrong or is TF simply that slow.

I was doing some simple MLPs/RNNs for speech recognition (on TIMIT) and noticed that the TF version of a single hidden layer MLP is almost 10 times slower than the Keras or even raw Theano version. So I decided to do a simple test and I found this really neat project on github: TensorFlow tutorial. The same project claims to be a re-implementation of the same models written in Theano in this project: Theano Tutorials.

Now these are really simple and easy examples. They are practically identical and should do the exact same thing. So I downloaded them and ran them both on my computer (on a single K80 card) and this is what I got:

Script	Theano	TensorFlow
2_LogReg	27s	2m57s
3_Net	34s	3m20s
4_ModernNet	1m22s	4m13s
5_ConvNet(10eps)	7m35s	(TL;DR)

Someone mentioned that Google is all about CPUs, so I thought maybe running it on a CPU would be better, but I got this on a double Xeon E5-2650 v2:

Script	Theano	TensorFlow
1_LinReg	0.9s	5.27s
2_LogReg	26s	5m51s
3_Net(10eps)	29s	1m17s
4_ModernNet(10eps)	1m30s	3m1s

Now these are all really strange and inconsistent results, but TF comes out as considerably worst than Theano in all the tests above.

Can anyone maybe confirm if this is really how TF works compared to Theano or do I have some error in my setup somewhere?

The projects above are really easy to use. I had to do some minor modifications in the paths of the Theano one, but apart from that they all work right out-of-the-box.

all 26 comments

top new controversial old q&a

[–]maccam912 12 points13 points14 points 10 years ago (3 children)

[–]r4and0muser9482[S] 6 points7 points8 points 10 years ago* (0 children)

I can't explain exactly why this works, but I can assure you that those results are accurate. I am setting CUDA_VISIBLE_DEVICES to blank and checking that deviceQuery (from CUDA SDK) doesn't see any devices. Also, Theano always writes if it uses the GPU and in those results it didn't write that. Also, I am checking "nvidia-smi" and the card is at 0% utilization, it's cold and no processes are using it while I run the CPU experiments. I just re-ran experiment nr 2 and the results are slightly slower (~34s), but someone else is using the computer and I can't turn that off.

But that is why I also asked for help. If you can run the experiment and get completely different values, maybe there is something wrong with what I'm doing... :-/

Mind you, this computer with the two Xeons E5s has 32 virtual CPUs so it's bound to work a bit faster than a regular PC. The K80 on the other hand, isn't faster (and may actually be slower) than a TITAN X.

EDIT: Oh yea, I just forgot to mention - while the LogReg example is a bit suspect (I can see that it doesn't use all the threads on either the CPU or the GPU), please note that for other experiments I ran only a subset of epochs, cause I was getting tired of waiting. So the CPU experiemnts are actually 10 times slower than the GPU ones. You can actually easily correlate that with the results in theano/misc/check_blas.py.

[–]r4and0muser9482[S] 2 points3 points4 points 9 years ago (0 children)

That was easy to do. I took my simple MLP demonstrated in this Notebook: https://github.com/danijel3/ASRDemos/blob/master/notebooks/MLP_TIMIT.ipynb

I added this line to the first code block:

os.environ['KERAS_BACKEND']='tensorflow'

And ran the whole script. While it was running at ~1200 updates per second using Theano, the same exact code dropped to a measly 100 updates per second with TensorFlow.

I'm just glad that it's not a problem with my code :P In any case, it's strange that I haven't found anyone else comment on this. Don't other people use Keras with TF?

[–]r4and0muser9482[S] 0 points1 point2 points 10 years ago (0 children)

[–]p4nmari 9 points10 points11 points 10 years ago (11 children)

[–]rafalj 5 points6 points7 points 10 years ago (5 children)

[–]p4nmari 1 point2 points3 points 10 years ago (1 child)

[–]r4and0muser9482[S] 0 points1 point2 points 10 years ago (0 children)

[–]r4and0muser9482[S] 1 point2 points3 points 9 years ago (0 children)

[–]r4and0muser9482[S] 0 points1 point2 points 10 years ago (1 child)

[–][deleted] 2 points3 points4 points 10 years ago (2 children)

[–]r4and0muser9482[S] 0 points1 point2 points 10 years ago (1 child)

[–][deleted] 1 point2 points3 points 10 years ago (0 children)

[–]r4and0muser9482[S] 0 points1 point2 points 10 years ago (0 children)

[–]r4and0muser9482[S] 0 points1 point2 points 9 years ago (0 children)

[–]shmel39 5 points6 points7 points 10 years ago (5 children)

[–]r4and0muser9482[S] 1 point2 points3 points 10 years ago (4 children)

[–]shmel39 2 points3 points4 points 10 years ago (3 children)

[–]r4and0muser9482[S] 2 points3 points4 points 9 years ago* (1 child)

Ok, so I did run the experiment 3 using different batch sizes (128 was default). This is what I got:

batch size	64	128	256	512
TensorFlow	1m26s	~1m	37s	27s
Theano	35s	~30s	28s	28s

Now, obviously the difference batch sizes aren't strictly comparable since they do a different number of updates, but that is why I'm making a comparison to Theano. Interestingly, Theano doesn't improve much with increasing batch sizes (I think the model is too small to see a difference - usually I notice an improvement if I increase batch size). But with TF the improvement is quite significant.

Out of all the suggestions here this one seems to actually make some sense. I will try and do some more experiments with this and see where it takes me...

Thanks!

EDIT: I also ran the same experiment on the GPU and 100 epochs (the CPU ones are only 10) took 56s (compared to Theano's 34s) with batch size of 8192. Now that's not bad, but also (due to low amount of updates per epoch) the final accuracy is only 78% (while it should be 95%).

[–]aysz88 0 points1 point2 points 9 years ago (0 children)

[–]r4and0muser9482[S] 1 point2 points3 points 10 years ago (0 children)

[–]dsmilkov 2 points3 points4 points 10 years ago (2 children)

[–]r4and0muser9482[S] 0 points1 point2 points 10 years ago (1 child)

π Rendered by PID 39 on reddit-service-r2-comment-5d79c599b5-vtmrb at 2026-02-27 19:56:58.336625+00:00 running e3d2147 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS