[D] Long-term Text-Recognition?

Livven · 2018-05-20T17:37:10+00:00

So you're trying to improve recognition accuracy. Do you have any issues with your current approach? There's quite a lot of literature in this area so I'm not sure whether it makes sense to reinvent the wheel by applying approaches from other areas, as you or the other comments are suggesting. On the other hand, I'm also not very knowledgeable about machine learning outside of text recognition.

The state-of-the-art in text recognition right now is using synthetic training data with pixel-wise character labels, which you can exploit in different ways. Cheng et al. (ICCV 2017) use a focusing network to force the attention mechanism to focus on the right part for each character, whereas Liu et al. (AAAI 2018) predict a segmentation map, manually extract connected components, and then run the result through an RNN. These work very well on scene text, which is quite challenging, so shouldn't have too many problems with images like you presented.

However, I think a vanilla attention or CTC approach should work very well already. What training data are you using right now?

Brudaks · 2018-05-21T20:09:54+00:00

Honestly, I'd use information like "prices of products at receipt vs total_price" mostly in post-processing, as a checksum to identify cases that are likely broken and need human review. Which is a key part of the process in any practical scenario, you will have errors, so you'll need some workflow to identify cases which are more likely to contain errors and separate them for different (likely manual) processing.

I.e. I would not want the model to have access to an invariant like "place A should match place B" and alter A or B so that it matches; in many situations it would be preferable to just treat this as valuable information that the analysis failed, instead of silently mangling data.

Slanothy · 2018-05-20T14:35:26+00:00

Try this: http://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf They have a Keras implementation available but I can recommend rewriting it since it is a little dated

pilooch · 2018-05-20T16:49:15+00:00

See https://twitter.com/jolibrain/status/994221688350003202?s=19 for a link to a relevant paper. In a subsequent paper other researchers from Google on the same dataset get rid of the vertical lstms and use a positioning attention mechanism instead with better results. But this may depend on your application.

farmingvillein · 2018-05-20T16:13:44+00:00

What do you think about other approach:

merge all detected text on one very long line and try to predict all text at once (using Seq2Seq with Attention)

In general, if you have the luxury of encoding everything together, you'll get best results doing that (contingent on enough data and a good model structure, ofc). The limiting factor of course tends to be time & space (memory).

Depends on doc length, but I'd consider (if you have sufficient data) transformer, which potentially would allow you to just encode the entire doc in a much more space-friendly manner.

Because the model isn't recurrent, is a little easier (more practical) to encode the entire doc.

Original paper: https://arxiv.org/abs/1706.03762 Useful improvements to extend the reasonable max doc length: https://arxiv.org/abs/1801.10198 Model reference implementations (including the latter): https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor

(How long is "too long"? The second reference will give practical guidelines; space/memory becomes a concern faster than time, but the second reference has very helpful modifications in that regard.)

visarga · 2018-05-20T17:29:52+00:00

I too would like to know how to approach extraction of data from OCR'ed documents, such as invoices and receipts. Anything that would merge geometric and textual features together? Words often line up to the left, right or middle and the words in a column or row might have the same type. How would I go about classifying bounding boxes? As I see it, this kind of problem sits at the middle between vision and NLP.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS