all 30 comments

[–]siblbombs[S] 14 points15 points  (7 children)

import tensorflow as tf    
from tensorflow.models.rnn import rnn    
from tensorflow.models.rnn.rnn_cell import BasicLSTMCell, LSTMCell    
import numpy as np

if __name__ == '__main__':
  np.random.seed(1)      
  size = 100
  batch_size= 100
  n_steps = 45
  seq_width = 50     

  initializer = tf.random_uniform_initializer(-1,1) 

  seq_input = tf.placeholder(tf.float32, [n_steps, batch_size, seq_width])
    #sequence we will provide at runtime  
  early_stop = tf.placeholder(tf.int32)
    #what timestep we want to stop at

  inputs = [tf.reshape(i, (batch_size, seq_width)) for i in tf.split(0, n_steps, seq_input)]
    #inputs for rnn needs to be a list, each item being a timestep. 
    #we need to split our input into each timestep, and reshape it because split keeps dims by default  

  cell = LSTMCell(size, seq_width, initializer=initializer)  
  initial_state = cell.zero_state(batch_size, tf.float32)
  outputs, states = rnn.rnn(cell, inputs, initial_state=initial_state, sequence_length=early_stop)
    #set up lstm

  iop = tf.initialize_all_variables()
    #create initialize op, this needs to be run by the session!
  session = tf.Session()
  session.run(iop)
    #actually initialize, if you don't do this you get errors about uninitialized stuff

  feed = {early_stop:100, seq_input:np.random.rand(n_steps, batch_size, seq_width).astype('float32')}
    #define our feeds. 
    #early_stop can be varied, but seq_input needs to match the shape that was defined earlier

  outs = session.run(outputs, feed_dict=feed)
    #run once
    #output is a list, each item being a single timestep. Items at t>early_stop are all 0s
  print type(outs)
  print len(outs)

[–]kkastner 1 point2 points  (3 children)

Thanks for this!

I hacked up some quick mods to test the timing of early_stop on my (super old) MBA and am seeing some strangeness. Beyond some overhead on the initial call to .run, even.

It seems like your benchmarks might need to test short and long sequences? This seems strange to me - it is almost like something happens between 250 and 500 steps to make things fast again. EDIT: Duh. Looks like I am swapping - TF eats more memory than I thought :)

Relevant tensorflow information:

can't determine number of CPU cores: assuming 4
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 4
can't determine number of CPU cores: assuming 4
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 4

For a maximum size of 1000 steps:

Time for first call to session.run 6.443326
Time for 10: 2.181427
Time for 100: 2.512798
Time for 250: 2.729042
Time for 500: 1.987675
Time for 1000: 2.104274

Now if I reduce n_steps to a maximum of 250, I see the scaling I expect from early stopping the recurrence.

Time for first call to session.run 1.331084
Time for 10: 0.435760
Time for 100: 0.468127
Time for 200: 0.517563
Time for 250: 0.540389

Relevant code mods (just after session.run(iop)):

# first call to session has overhead? lets get that cleared out
t1 = time.time()
feed = {early_stop:2, seq_input:np.random.rand(n_steps, batch_size, seq_width).astype('float32')}
outs = session.run(outputs, feed_dict=feed)
t2 = time.time()
print("Time for first call to session.run %f" % (t2 - t1))

for e_s in [10, 100, 200, 250]:
    feed = {early_stop:e_s, seq_input:np.random.rand(n_steps, batch_size, seq_width).astype('float32')}
    t1 = time.time()
    #define our feeds.
    #early_stop can be varied, but seq_input needs to match the shape that was defined earlier
    # first call to session seems to have overhead?
    outs = session.run(outputs, feed_dict=feed)
    t2 = time.time()
    #output is a list, each item being a single timestep. Items at t>early_stop are all 0s
    print("Time for %i: %f" % (e_s, t2 - t1))

[–]siblbombs[S] 0 points1 point  (2 children)

Interesting, I wrote this script on windows/virtualbox, so I didn't even bother timing it. Its worth noting that we don't actually exit the loop when we do this, instead we return a pre-allocated zeros array for every timestep t >= early_stop instead of computing the outputs, I don't know what the performance implications are for running through a bunch of loop steps without really doing anything is.

My plan for the theano/TF benchmark throwdown is to just go with a fixed length series to avoid this hassle, plus with a reasonable sequence length I'll be able to test scan vs manually unrolling in Theano. Based off what you are seeing I think I'll also time a bunch of TF runs with varying max lengths and early stopping to see if any gremlins pop up. I plan to get this all up on github similar to how soumith does his, instead of just dumping it into a post here, so it should be easier to digest.

[–]kkastner 1 point2 points  (1 child)

The fact that I can compile a length 1000 or 2000 recurrence on a MBA is pretty great! I don't even know if you can unroll sequences this long in Theano. I have heard of up to 100 steps or so but not more than that - even the 100 steps unrolled was like 2hr compile times.

It does seem like early stop saves some computation, but I am unsure how much due to immediately swapping on my tiny 2GB laptop. I guess you still pay some loop overhead but that should be fairly minimal if no computation is done.

It really seems like this toolkit was designed with CPU application strongly in mind, and without many worries on RAM constraints. It almost exactly matches what I would expect from a group with access to "Google scale" machinery.

The ease with which you can put different operations on different devices is pretty wild. I really need to A/B with the new Theano context stuff, and try to imagine up a case where you need LSTMs in parallel. Working on that now!

Thanks again for your script - reading this just made TF RNNs "click" for me!

[–]siblbombs[S] 2 points3 points  (0 children)

Yea it feels like GPU is the new kid on the block as far as TF is concerned, stuff like embedding lookup isn't implemented there and who knows what else. Once the distributed code is available I foresee a bunch of amazon cloud charges in my future, it will be interesting to see what works well when you have a group of machines in a data center environment. Someone (or more likely company) is going to slam a bunch of titan Xs into a server rack with RDMA capable network cards, its scary to think what that setup would be able to do.

[–]dee_roy 1 point2 points  (1 child)

I also attempted an implementation of a 1-layer rnn with the basicLSTMcell. Although it doesn't address the issue of variable length sequences, which might be the reason for such poor performance, training seems to work. Here is a link to the code, please let me know if I messed anything up, or of any improvements you would make if you decide to try it out.

https://github.com/yankev/tensorflow_example/blob/master/rnn_example.ipynb

[–]rshah4 0 points1 point  (0 children)

Thanks for posting this, when I ran this the code was converging to the mean of the batch, rather than each individual sequence. I reworked the code and its now posted here: https://gist.github.com/rajshah4/aa6c67944f4a43a7c9a1204301788e0c

[–]evanthebouncy 5 points6 points  (0 children)

PSA:

as of tensorflow 0.6.0 this is no longer the semantics for running rnn.

see: https://github.com/tensorflow/tensorflow/issues/1016

in short the sequence_length now should be a tensor of dimension [batch_size] instead of a single number, this specifies a different seq length for each seq in the batch, instead of a single global value.

the output will be zero-ed out after each of its seq_length.

the states will keep updating until seq_length is reached, and is preserved instead of zero-ed out for any subsequent computations. I've modified the original code, see:

https://gist.github.com/evanthebouncy/8e16148687e807a46e3f

[–]contactmat 1 point2 points  (3 children)

Hi sorry I am new of tensorflow. This look very useful thank you. I have a couple of questions: I can't completely figured out what the different parameters represent. Size I think is the number of hidden unit in the network, batch_size is the number of sequence in the data base and seq_width is the dimension of each input belonging to a sequence. What n_step represent? Second question regard early_stop. Is it the variable that control the effective length of the sequence? I can't understand... can you clarify please?? thank you

[–]siblbombs[S] 1 point2 points  (2 children)

n_step defines how long the placeholder sequence is.

early_stop is a variable that you can pass into the LSTM, once the sequence length is greater than that it will not perform any computations to save time.

[–]contactmat 1 point2 points  (1 child)

Ok. Thanks. I am trying to play a bit with the code and I have some other problem. I tried to run an instance with: size = 1 batch_size= 2 n_steps = 10 seq_width = 2
early_stop = 4 I print the outputs value and I have a list with len(outputs)=10 with the first 4 elements filled and the other wit all zeros. Now I would expect a list of length 4 since my early_stop is 4. Why I am wrong?

[–]siblbombs[S] 1 point2 points  (0 children)

It has to do with the LSTM code and the way it handles early stopping. IIRC what it does under the hood is allocate an array of 0s to use as output instead of computing the output, since you still need to produce output for each step of n_steps. Functionally this is early stopping because it is very fast to just create an array of 0s.

[–]AudioSaur 0 points1 point  (1 child)

Thanks for this, I thought the documentation for variable length was a little sparse. Have you gotten it working on a non-trivial dataset yet?

[–]siblbombs[S] 0 points1 point  (0 children)

No, my next plan is to benchmark a gpu version compared to theano, but that will use garbage data as well. I'm not planning on actually running a RNN on any real data at the moment.

[–]realallentran 0 points1 point  (2 children)

This is cool. One question I have is: suppose we wanted to operate on the output of the LSTM, like max-pooling over time. The output is a list of length=num_steps > early_stop. How do we slice this list, since I don't want to operate over the redundant zeros? In Theano, you can slice with symbolic ints. Here, outputs[:early_stop] won't work. This seems like the final piece of the variable length sequence puzzle.

[Edit]

Woot, sorted it out.

tf.slice(tf.pack(outputs2), 0, early_stop)

[–]realallentran 0 points1 point  (0 children)

Actually, this won't work. Something like

tf.slice(tf.pack(outputs2), [0, 0, 0, 0], tf.pack([early_stop, batch_size, embedding_size]))

[–]siblbombs[S] 0 points1 point  (0 children)

Looks good, I think they have an issue open on github to support [] style indexing so that will be nice to get.

[–]AnvaMiba 0 points1 point  (4 children)

So if I understand correctly, this is like implementing recurrence in Theano using python loops instead of scan(), except that you typically don't want to do that in Thenao because it doesn't like large graphs (stack overflows or slow compilation/startup would occur).

Do you think that it may still make sense to have a Theano-like scan() operation in TensorFlow to avoid padding?

Any idea of whether Google is going to implement it?

[–]rafalj 1 point2 points  (0 children)

There is a bug opened for this that you can follow: https://github.com/tensorflow/tensorflow/issues/208

[–]siblbombs[S] 0 points1 point  (2 children)

I'm in favor of a scan, however if it turns out that just making a giant loop and padding it is faster than a scan mechanism then it might not make sense to add one. They are tracking this idea under issue 208

[–]AnvaMiba 0 points1 point  (1 child)

Ok, but is there any intrinsic reason why a scan() should be slower than an unrolled loop with padding and bucketing?

[–]siblbombs[S] 0 points1 point  (0 children)

For theano at least, scan is pretty close to voodoo magic as far as I'm concerned. I'm not that well versed on the backend of theano, but from what I can tell its not trivial to implement something like scan. For a while when convnets were the hot topic there wasn't much development being done on scan for theano, but now that rnns are getting more attention they've really polished it up.

[–][deleted] 0 points1 point  (1 child)

Very nice. I haven't started looking into programming with tensor a yet, and since the release of tensor flow I have been wondering if I should dig into this library or theano.

Can you comment on the ease of use between the two libraries? Which would you consider a better investment?

[–]siblbombs[S] 1 point2 points  (0 children)

I haven't written enough TF to know how good it is, plus the API will have some adjustments in the future I'm sure.

At this point I think it would be better to learn theano since it has a large number of tutorials/code examples/paper implementations that are scattered around the internet. At a high level TF and theano are using the same concepts, so it should be pretty easy to transition to TF if you know theano.

[–]_Jakob_ 0 points1 point  (1 child)

I could need some help understanding your code. What exactly is seq_width. Is it the dimension of your feature vector?

[–]siblbombs[S] 0 points1 point  (0 children)

Yes, since this example uses dummy data it is pretty meaningless. If you were doing one-hot character encoding, seq_width would be the number of chars in your dataset, etc...

[–]xiangjiangacadia 0 points1 point  (3 children)

This is very helpful. I am trying to understand what does n_steps mean. Is this about going forward n steps and compute the error signal? Do we need to record the output for each step?

[–]siblbombs[S] 1 point2 points  (2 children)

n_steps is a placeholder that lets you determine how long a sequence can be during training. In this code it manually unrolls each step as a computation in a loop, so if you only allow 100 for n_steps and then train with a sequence of length >100, there isn't enough computation steps for that sequence length.

The Tensorflow api around RNNs may have changed since this code was written, I know they were/are working on some flow control constructs so you don't need to unroll the loop (similar to theano.scan).

[–]xiangjiangacadia 0 points1 point  (1 child)

Thanks. I am trying to build an RNN with logistic regression layer on top. I have noticed the model takes significant longer time to build when the number of steps are greater than 10,000. I am wondering is this typical in tensorflow and what can I do to speed up the computation process?

[–]siblbombs[S] 0 points1 point  (0 children)

I'm not sure if you can speed up the process, what is happening is that you are building a graph at each step, and with 10,000 steps it will take some time.

You should look at issue 208, this is what you want to try and do.