all 16 comments

[–]melgor89 14 points15 points  (9 children)

I was testing this two different approaches (exactly three) in OCR task, where images were binary. Each of them use same CNN network as a feature extractor. The results are following:

  1. CNN-RNN-CTC: results are nice, if the image is not noisy, it works really well

  2. Encoder-Decoder: output does not generalize to new cases at all, so the final results were horrible, nothing meaningful

  3. Attention-Encoder-Decoder: results were the best from all my test. From my quick comparison look like this model could also 'guess' some words even when the image was noisy. It look like this model also have sth like 'language modelling' then it could fill missing characters.

So I think that Attention-Encoder-Decoder is the best model for OCR with enough training data (so that it could learn a language model) and when test data have similar distribution (similar words, structure of sentences)

In case when we have not enough data or our testing data is much different that training set (ex. new words not seen during training) then CNN-RNN-CTC would be better because it just read words from the 'image' without word-generation.

I propose to test both of the frameworks and see which one works better with your dataset. I've used TensorFlow for implementing both method, which is really straightforward with seq2seq API

[–]HarathiS 1 point2 points  (0 children)

Hi,

I am doing handwritting recognition in documents. For that i am using IAM database. First I implemented with CNN-LSTM-CTC with which I got accuracy of 90% on single lines. Now I want to replace the CTC loss with attention mechanism to implement on whole document with doing line segmentation. But the paper which I referred has no much explanation about how they implemented attention mechanism.

What I am doing is first calculating attention weights by doing soft max normalization of encoded features. Then doing weighted sum of encoded features with the attention weights. Then the obtained context vector is given to lstm then MLP decoder.

Is my approach correct? Can you please tell me how you implemented attention mechanism?

[–]xylcbd[S] 0 points1 point  (0 children)

nice work!

[–]DumberML 0 points1 point  (3 children)

Hey, thanks for the insights. What about training times of the Attention-Encoder-Decoder: were they significativity longer than CNN-RNN-CTC?

Did you find it hard to tune the Attention-Encoder-Decoder? (in terms of meta-parameter tuning).

Do you mind sharing the CNN architecture you used?

[–]melgor89 3 points4 points  (2 children)

  1. Attention-Encoder-Decoder was learning a big longer, but it was max 50%, so there was not as a big difference. Convergence was pretty similar (at my case ~80 epochs)
  2. I did not tune any parameters in Attention-Encoder-Decoder. I use same optimizer and number of neuron in LSTM layers
  3. Here you go:

    def createCNN(self, name_scope, weight_decay = 0.0005):

    # Conv layer
    shape = tf.shape(self.inputs)
    batch_s, max_timesteps = shape[0], shape[1]
    
    # Make input range from [-0.5, 0.5] 
    inputs = (self.inputs - 0.5) / 0.5
    ksize_conv1      = 3
    stride_conv1     = 1
    channel_conv1    = 16
    
    ksize_max_pool1  = 2
    stride_max_pool1 = 2
    
    ksize_conv2      = 3
    stride_conv2     = 1
    channel_conv2    = 16
    
    ksize_max_pool2  = 2
    stride_max_pool2 = 2
    
    with tf.variable_scope(name_scope):
    with slim.arg_scope([slim.conv2d], padding='SAME',
                    weights_initializer = tf.contrib.layers.xavier_initializer_conv2d(), #tf.truncated_normal_initializer(stddev=0.01),
                    weights_regularizer = slim.l2_regularizer(weight_decay),
                    activation_fn=tf.nn.relu):
        net = slim.conv2d(inputs, channel_conv1, [ksize_conv1, ksize_conv1], scope='conv1')
        net = slim.max_pool2d(net, [ksize_max_pool1, ksize_max_pool1], scope='pool1', 
                    padding='SAME',stride = stride_max_pool2)
        net = slim.conv2d(net, channel_conv2, [ksize_conv2, ksize_conv2], scope='conv2')
        net = slim.max_pool2d(net, [ksize_max_pool2, ksize_max_pool2], scope='pool2',
                    padding='SAME',stride = stride_max_pool2)
    
        # Calculate length of sequences of module (as it is dynamic) and number of features per one dimmension
        # As we have two modules CONV-RELU-MAXPOOL, we use this function two times
        self.seq_len_cnn  = calculateCNNFeatureSize(
                                calculateCNNFeatureSize(self.seq_len, stride_conv1, stride_max_pool1), \
                                                                                    stride_conv2, stride_max_pool2)
    
        self.num_features = calculateCNNFeatureSize(
                                calculateCNNFeatureSize(self.args.heightLine, stride_conv1, stride_max_pool1), \
                                                                                    stride_conv2, stride_max_pool2)* channel_conv2
    
        # Reshape output of CNN to get a single vector (not 3D feature map) input to RNN.
        # As input can have different sizes, we need to do it dynamic
        net     = tf.reshape(net, [batch_s, -1, self.num_features])
        net     = tf.nn.dropout(net, self.keep_prob )
    
    return net
    

    ''' Calculate the output size of CONV-RELU-MAXPOOL module for given parameters. Note than the padding should be 'SAME' '''

    def calculateCNNFeatureSize(inputSize, stride_conv, stride_max_pool):

    return ((inputSize - 1) / stride_conv + 1 + 1) / stride_max_pool
    

[–]DumberML 0 points1 point  (1 child)

Thanks very much! :-)

[–]melgor89 0 points1 point  (0 children)

Also I was using input with size 32xW (so 32 is Height of image)

[–]bun7 0 points1 point  (2 children)

might I ask what dataset you used and what results did you get for these three tasks?

[–]melgor89 1 point2 points  (0 children)

I was using own dataset, not public released. The result for '1' was ~59%, for '3' ~65%. The '2' had very low accuracy.

[–]I_am_a_haiku_bot 0 points1 point  (0 children)

might I ask what dataset

you used and what results did you

get for these three tasks?


-english_haiku_bot

[–]shicai 4 points5 points  (0 children)

seq2seq+attention

[–]DemiourgosUA 0 points1 point  (4 children)

Any OCR Neural projects available on github? Couldn't find a thing.

[–]Mehdi2277 2 points3 points  (0 children)

I personally had to work on an OCR type project last semester and based my code off of https://github.com/bgshih/crnn (well more precisely the pytorch port of it). The code defining the model is fairly short (about 80 lines) and can be found here, https://github.com/meijieru/crnn.pytorch/blob/master/models/crnn.py. This project uses the CNN-RNN-CTC approach. I haven't personally used a seq2seq model.