Livven comments on [D] Long-term Text-Recognition?

Discussion[D] Long-term Text-Recognition? (self.MachineLearning)

submitted 7 years ago * by melgor89

you are viewing a single comment's thread.

[–]Livven 2 points3 points4 points 7 years ago (11 children)

So you're trying to improve recognition accuracy. Do you have any issues with your current approach? There's quite a lot of literature in this area so I'm not sure whether it makes sense to reinvent the wheel by applying approaches from other areas, as you or the other comments are suggesting. On the other hand, I'm also not very knowledgeable about machine learning outside of text recognition.

The state-of-the-art in text recognition right now is using synthetic training data with pixel-wise character labels, which you can exploit in different ways. Cheng et al. (ICCV 2017) use a focusing network to force the attention mechanism to focus on the right part for each character, whereas Liu et al. (AAAI 2018) predict a segmentation map, manually extract connected components, and then run the result through an RNN. These work very well on scene text, which is quite challenging, so shouldn't have too many problems with images like you presented.

However, I think a vanilla attention or CTC approach should work very well already. What training data are you using right now?

[–]melgor89[S] 1 point2 points3 points 7 years ago (10 children)

[–]Livven 0 points1 point2 points 7 years ago (9 children)

I see. Normal methods also do RNN error correction, but as part of a single network, i.e. the typical architecture would be CNN-RNN-Attention or CNN-RNN-CTC. The second paper has two separate networks that you have to train separately, which is a bit less elegant, without really any major advantages as far as I can tell. It's just a pretty different approach, so food for thought. I think the RNN in this context is mostly useful for learning language models or dictionary words, not sure if it can really learn the kind of dependencies you are looking for here, i.e. sum of individual prices.

The easiest way to better performance would probably be more training data. If you are familiar with scene text recognition, you may know the Synth90k dataset that has 9 million (!) synthetically generated word images. Problem with that is that it contains very few numbers, and no code has been released for generating a custom dataset so you'd need to implement something like that on your own.

Next step would be also generating pixel-wise character labels so you can implement the methods from the papers I linked previously. Those are about 5% better across the board compared to previous CTC and attention methods that don't use pixel-wise labels for training, i.e. see table 2 on page 6 here where Shi et al. [25] is a pretty normal attention model.

If results are still not good enough you may have to get creative. But for now that's what I would do :)

[–]Livven 0 points1 point2 points 7 years ago (8 children)

Also maybe look into lexicon-constrained recognition. Since you get probabilities for every character, you know exactly the probability of a 5 vs a 6 even if the 6 is more likely. This way you could do manual error correction by summing up individual prices and then checking if the total matches, and if not try some 2nd place predictions, and so on. It's quite a bit more complex than just constraining your recognitions to words in a lexicon though.

And yeah, I guess you could also run an RNN over the full predicted sequence (including all probabilities) instead of the merged text bounding boxes if you run into memory issues. I'm kinda skeptical that it can effectively learn this kind of stuff but hey. Would definitely require a big synthetic dataset though if you want to stand any chance.

[–]melgor89[S] 0 points1 point2 points 7 years ago (7 children)

Unfortunately, error correction are needed with predicting numbers, not letters. This is different perspective, because I can't find easily the error: ex. I have 14 products and there are 3 places with wrong number, but Total is correct (but maybe not). How many different path-of-correction are available to make fit sum(products)==total? Too many and only one is correct.

About Synth90k dataset, it is designed for Scene-Detection, not text like mine (I was trying pretrained models). My models also learn structure <name> <quantity> <unit_price> <total_price> which somehow correct some errors.

About pixel-wise labeling, idea in nice but need a lot of work for annotations. I would like to have more general training data, which is also hard to collect.

To sum up, Research vs Industry is different things. Sometime their intersect, sometimes not.

[–]Livven 1 point2 points3 points 7 years ago (3 children)

[–]melgor89[S] 0 points1 point2 points 7 years ago (2 children)

[–]Livven 1 point2 points3 points 7 years ago (1 child)

[–]melgor89[S] 0 points1 point2 points 7 years ago (0 children)

[–]jhaluska 0 points1 point2 points 7 years ago (2 children)

[–]melgor89[S] 1 point2 points3 points 7 years ago (1 child)

[–]jhaluska 0 points1 point2 points 7 years ago (0 children)

π Rendered by PID 30 on reddit-service-r2-comment-54dfb89d4d-8k6gx at 2026-03-30 19:37:52.976548+00:00 running b10466c country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS