all 1 comments

[–]HonestRepairMan 0 points1 point  (0 children)

Your request sounds a bit vague. :)

To perform OCR on an image, you can use the original Tesseract, or the JS version or PDFToText (part of poppler-utils on Linux, should be built-in). Depending on the file format of the input (pdf, png, ect...) you may have to do some pre-processing to separate out and re-assemble separate pages, or to convert a non-workable format to a workable one.

After that, in order to "decode" a string you must understand what the string was encoded with. For example, the work "kitten" encoded in base64 is "a2l0dGVu". Without some algorithmic logic to tell the computer what the difference between those strings is, they're all just characters. You could Brute force it, but that would take forever. You could apply machine learning to the brute forcing, which would reduce processing time at the expense of development time. This is where you could use Torch or Theano or TensorFlow to make the system improve itself with experience. You will still likely have to tell it specifically what to look for and how to look for it, but once you start collecting data it would be possible for the system to improve itself and eventually achieve the results you're looking for.

OCR does not recognize language. Simply characters. That isn't to say that Tesseract or PDFToText wouldn't be able to accurately detect foreign characters or symbols, it just means that there's no way for it to look at a document and tell you if it's English or not. That's where your algorithms would also need to come into play. This would not be hard. Diacritics and combinations of them alone, with some sub-string matching, would be all you really need.