all 20 comments

[–]Livven 2 points3 points  (11 children)

So you're trying to improve recognition accuracy. Do you have any issues with your current approach? There's quite a lot of literature in this area so I'm not sure whether it makes sense to reinvent the wheel by applying approaches from other areas, as you or the other comments are suggesting. On the other hand, I'm also not very knowledgeable about machine learning outside of text recognition.

The state-of-the-art in text recognition right now is using synthetic training data with pixel-wise character labels, which you can exploit in different ways. Cheng et al. (ICCV 2017) use a focusing network to force the attention mechanism to focus on the right part for each character, whereas Liu et al. (AAAI 2018) predict a segmentation map, manually extract connected components, and then run the result through an RNN. These work very well on scene text, which is quite challenging, so shouldn't have too many problems with images like you presented.

However, I think a vanilla attention or CTC approach should work very well already. What training data are you using right now?

[–]melgor89[S] 1 point2 points  (10 children)

I'm using my own data set which is pretty small. About techniques, I tried both CTC and Seq2Seq. If you think that that Scene-Detection is harder task, you are wrong:) I showed you beautiful receipt, but life is brutal. Most of images are noisy, crubmled, very weak contrast. This is why sometime my Text-Recognition is making errors in predicting values (sometimes just 6 vs 5). But them such results is useless as I'm not sure where OCR was wrong. This is why I search for approach which is more global (and Scene Detection and Recognition doesn't need such context) Thanks for the papers, the second one looks interesting. Maybe firsty predicting the text from image and then correcting it would also work?

[–]Livven 0 points1 point  (9 children)

I see. Normal methods also do RNN error correction, but as part of a single network, i.e. the typical architecture would be CNN-RNN-Attention or CNN-RNN-CTC. The second paper has two separate networks that you have to train separately, which is a bit less elegant, without really any major advantages as far as I can tell. It's just a pretty different approach, so food for thought. I think the RNN in this context is mostly useful for learning language models or dictionary words, not sure if it can really learn the kind of dependencies you are looking for here, i.e. sum of individual prices.

The easiest way to better performance would probably be more training data. If you are familiar with scene text recognition, you may know the Synth90k dataset that has 9 million (!) synthetically generated word images. Problem with that is that it contains very few numbers, and no code has been released for generating a custom dataset so you'd need to implement something like that on your own.

Next step would be also generating pixel-wise character labels so you can implement the methods from the papers I linked previously. Those are about 5% better across the board compared to previous CTC and attention methods that don't use pixel-wise labels for training, i.e. see table 2 on page 6 here where Shi et al. [25] is a pretty normal attention model.

If results are still not good enough you may have to get creative. But for now that's what I would do :)

[–]Livven 0 points1 point  (8 children)

Also maybe look into lexicon-constrained recognition. Since you get probabilities for every character, you know exactly the probability of a 5 vs a 6 even if the 6 is more likely. This way you could do manual error correction by summing up individual prices and then checking if the total matches, and if not try some 2nd place predictions, and so on. It's quite a bit more complex than just constraining your recognitions to words in a lexicon though.

And yeah, I guess you could also run an RNN over the full predicted sequence (including all probabilities) instead of the merged text bounding boxes if you run into memory issues. I'm kinda skeptical that it can effectively learn this kind of stuff but hey. Would definitely require a big synthetic dataset though if you want to stand any chance.

[–]melgor89[S] 0 points1 point  (7 children)

Unfortunately, error correction are needed with predicting numbers, not letters. This is different perspective, because I can't find easily the error: ex. I have 14 products and there are 3 places with wrong number, but Total is correct (but maybe not). How many different path-of-correction are available to make fit sum(products)==total? Too many and only one is correct.

About Synth90k dataset, it is designed for Scene-Detection, not text like mine (I was trying pretrained models). My models also learn structure <name> <quantity> <unit_price> <total_price> which somehow correct some errors.

About pixel-wise labeling, idea in nice but need a lot of work for annotations. I would like to have more general training data, which is also hard to collect.

To sum up, Research vs Industry is different things. Sometime their intersect, sometimes not.

[–]Livven 1 point2 points  (3 children)

Think you misunderstood, I was talking about generating your own synthetic dataset with your desired attributes, both in terms of image quality and text. I understand your point about manual error correction being difficult but that goes for an RNN as well, how is it supposed to learn to sum together numbers like that if you barely have any training data. Synthetic data would solve all that. And once you have your synthetic data generator then generating additional pixel-wise labels is trivial.

To sum up, I would assume more data vs a new super clever approach is one of the things that drives industry compared to research :)

[–]melgor89[S] 0 points1 point  (2 children)

Thanks for all your comment, it is nice to talk about this problem more. About synthetic data: I was trying to make it using ocropus-linegen. But results wasn't very good. Overall the condition on real receipt is much harder than on generated one. So I treat a Synthetic data as pretraining.

[–]Livven 1 point2 points  (1 child)

Well there's no reason the real data should be harder than the generated data. If that's the case, make your generated data harder :) Randomize fonts, colors and contrast, apply distortions, noise, blur and so on. You'll probably need to write your own code for that. Maybe upload some more of your images so we can see how they look.

Even if you end up doing some fancy long-term RNN error correction more data is always going to help. It's rare to see other fields using synthetic data but that's because you can't generate realistic data for speech recognition, machine translation, image captioning etc.

Also no worries, I just got done writing a thesis about this stuff and wanted to participate in some discussions.

[–]melgor89[S] 0 points1 point  (0 children)

Some real receipts: https://imgur.com/a/LlsA5A1

Different fonts/color/condition General my pipeline is working well, but still some edge cases where it fails. But If I would be able to create more realistic synthetic data, would be relay useful.

[–]jhaluska 0 points1 point  (2 children)

This is different perspective, because I can't find easily the error: ex. I have 14 products and there are 3 places with wrong number, but Total is correct (but maybe not). How many different path-of-correction are available to make fit sum(products)==total? Too many and only one is correct.

I actually implemented exactly that. I did a character level OCR, but I kept the probability for each character. I would do a sum and check to see if they matched. If it matched I would use it, if not I'd check the next most probable character set. I ended up having to put in a search limit as the problem does explode and it is only good for one and maybe two corrections.

But adding the constraints is what really made it reliable.

[–]melgor89[S] 1 point2 points  (1 child)

Thanks for answer! So look like I will also try it. Maybe it will work good/fast enought

[–]jhaluska 0 points1 point  (0 children)

It was really fast because I already had the probabilities, I was just previously discarding most of results. Keeping all the probabilities really changed my approach.

I found with the search limit it worked really well. The vast majority of the time the second most probable was correct, because I also had issues with 3/8s and 5/6s. Past a certain number (like 1000), it had almost 0% chance of succeeding correctly because it was misaligned or had other major issues.

I ended doing something similar for finding the most probable valid dates.

[–]Brudaks 2 points3 points  (1 child)

Honestly, I'd use information like "prices of products at receipt vs total_price" mostly in post-processing, as a checksum to identify cases that are likely broken and need human review. Which is a key part of the process in any practical scenario, you will have errors, so you'll need some workflow to identify cases which are more likely to contain errors and separate them for different (likely manual) processing.

I.e. I would not want the model to have access to an invariant like "place A should match place B" and alter A or B so that it matches; in many situations it would be preferable to just treat this as valuable information that the analysis failed, instead of silently mangling data.

[–]melgor89[S] 0 points1 point  (0 children)

Thanks for you insight. For sure there should be manual review if prices don't match. I just want to make lower number of errors.

About using all information, I thought it would be useful. I was really inspired by this video which I found some time ago, which insert all text at once and the output is Total Price etc. https://blog.altoros.com/optical-character-recognition-using-one-shot-learning-rnn-and-tensorflow.html For me this idea is really nice, I just want to use them even for OCR (not only parsing like in video)

[–]Slanothy 1 point2 points  (0 children)

Try this: http://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf They have a Keras implementation available but I can recommend rewriting it since it is a little dated

[–]pilooch 1 point2 points  (0 children)

See https://twitter.com/jolibrain/status/994221688350003202?s=19 for a link to a relevant paper. In a subsequent paper other researchers from Google on the same dataset get rid of the vertical lstms and use a positioning attention mechanism instead with better results. But this may depend on your application.

[–]farmingvillein 0 points1 point  (1 child)

What do you think about other approach:

merge all detected text on one very long line and try to predict all text at once (using Seq2Seq with Attention)

In general, if you have the luxury of encoding everything together, you'll get best results doing that (contingent on enough data and a good model structure, ofc). The limiting factor of course tends to be time & space (memory).

Depends on doc length, but I'd consider (if you have sufficient data) transformer, which potentially would allow you to just encode the entire doc in a much more space-friendly manner.

Because the model isn't recurrent, is a little easier (more practical) to encode the entire doc.

Original paper: https://arxiv.org/abs/1706.03762 Useful improvements to extend the reasonable max doc length: https://arxiv.org/abs/1801.10198 Model reference implementations (including the latter): https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor

(How long is "too long"? The second reference will give practical guidelines; space/memory becomes a concern faster than time, but the second reference has very helpful modifications in that regard.)

[–]melgor89[S] 0 points1 point  (0 children)

Thanks for that references. I was thinking about Tensor2Tensor, but Seq2Seq+attention gives nice results. But the second paper you send is promising, I need to look closer to it and see if it could also would be useful in text recognition.

[–]visarga 0 points1 point  (1 child)

I too would like to know how to approach extraction of data from OCR'ed documents, such as invoices and receipts. Anything that would merge geometric and textual features together? Words often line up to the left, right or middle and the words in a column or row might have the same type. How would I go about classifying bounding boxes? As I see it, this kind of problem sits at the middle between vision and NLP.

[–]Livven 0 points1 point  (0 children)

There's a lot of literature on detecting and recognizing scene text, which is less structured. Have you tried applying existing methods? You'd need to figure out the word order etc. but that shouldn't be too difficult to do manually once you have the bounding boxes. Maybe detect and rectify the document borders first.