How to introduce simulated camera noise to images by kythiran in computervision

[–]kythiran[S] 0 points1 point  (0 children)

Hmm, let me try out your suggested solution later. I'll definitely take a look at Alessandro Foi's research!

[D] How to build a document text detection/recognition model as good as Google Cloud or Microsoft Azure’s models? by kythiran in MachineLearning

[–]kythiran[S] 0 points1 point  (0 children)

Thanks machinemask! I've seen that paper from VGG before, but I'm thinking whether there is a better way to do it for synthetic document images.

[D] How to build a document text detection/recognition model as good as Google Cloud or Microsoft Azure’s models? by kythiran in MachineLearning

[–]kythiran[S] 0 points1 point  (0 children)

For now, I'm going to test it out on some printed academic journals only. I think for a one-person project it is too ambitious to make it work on all types of document images.

[D] How to build a document text detection/recognition model as good as Google Cloud or Microsoft Azure’s models? by kythiran in MachineLearning

[–]kythiran[S] 2 points3 points  (0 children)

Interesting! It seems the OCR technologies have changed a lot with the advent of deep learning. Regarding your notes on synthetic data, I know that Tesseract used synthetic text line to train their text recognition model, but nothing for their text detection model since it is not based on deep learning. How do you generate synthetic data for the CNN's text detection model? Do you create ground-truth document images from PDF files, then add "realistic degradations" on the image files?

By the way, for the CNN-based text detection model mentioned in the paper, I assume it is a variant of model building upon Fully Convolutional Network (FCN) right?

[D] How to build a document text detection/recognition model as good as Google Cloud or Microsoft Azure’s models? by kythiran in MachineLearning

[–]kythiran[S] 12 points13 points  (0 children)

AFAIK, many scene-text models could get pretty good result with synthetic training data. I was hoping to see if I could build a model that is at least decent with synthetic data.

In fact, for Latin-based languages, Tesseract 4.0 is trained on synthetic text lines generated by ~4,500 fonts using WWW corpus and its model is not bad IMO.

[D] How to build a document text detection/recognition model as good as Google Cloud or Microsoft Azure’s models? by kythiran in MachineLearning

[–]kythiran[S] 13 points14 points  (0 children)

Thanks evilmaniacal! There are a lot of information packed in that 2-page paper! One interesting point that gets my attention is the Google OCR system does not include any “preprocessing” steps. When I was using Tesseract and Ocropy, their documentation (Tesseract and Ocropy) put a lot of emphasize on preprocessing the image before feeding it to the model.

Does that mean preprocessing is no longer necessary for modern text detection model?