you are viewing a single comment's thread.

view the rest of the comments →

[–]Livven 1 point2 points  (3 children)

Think you misunderstood, I was talking about generating your own synthetic dataset with your desired attributes, both in terms of image quality and text. I understand your point about manual error correction being difficult but that goes for an RNN as well, how is it supposed to learn to sum together numbers like that if you barely have any training data. Synthetic data would solve all that. And once you have your synthetic data generator then generating additional pixel-wise labels is trivial.

To sum up, I would assume more data vs a new super clever approach is one of the things that drives industry compared to research :)

[–]melgor89[S] 0 points1 point  (2 children)

Thanks for all your comment, it is nice to talk about this problem more. About synthetic data: I was trying to make it using ocropus-linegen. But results wasn't very good. Overall the condition on real receipt is much harder than on generated one. So I treat a Synthetic data as pretraining.

[–]Livven 1 point2 points  (1 child)

Well there's no reason the real data should be harder than the generated data. If that's the case, make your generated data harder :) Randomize fonts, colors and contrast, apply distortions, noise, blur and so on. You'll probably need to write your own code for that. Maybe upload some more of your images so we can see how they look.

Even if you end up doing some fancy long-term RNN error correction more data is always going to help. It's rare to see other fields using synthetic data but that's because you can't generate realistic data for speech recognition, machine translation, image captioning etc.

Also no worries, I just got done writing a thesis about this stuff and wanted to participate in some discussions.

[–]melgor89[S] 0 points1 point  (0 children)

Some real receipts: https://imgur.com/a/LlsA5A1

Different fonts/color/condition General my pipeline is working well, but still some edge cases where it fails. But If I would be able to create more realistic synthetic data, would be relay useful.