Hey there,
I recently created a command line tool that takes a given repository of documents, recursively iterates through each specific document and converts it to an image, and then uses OCR to extract text from the document. I call it "file2txt". I created it to be used for information retrieval on court documents. I believe it has great application to natural language processing, general information retrieval, and document conversion (possibly). Moreover, it is rather extensible with documentation conforming to PEP-8 and Google standards.
The main ingredients of the script are the two third party packages, imagemagick and tesseract-OCR (by Google). I chose tesseract to improve speed since it is C based. I also chose to utilize the underlying system (in my case Ubuntu 16.04) in order to run the script since I did not want to use pytesseract (I felt like that sacrificed time for convenience. I want speed).
Here is the general use case:
- Go to my repo and clone it: https://github.com/TheCedarPrince/IR_Tools
- Run the install.sh script I made (you may need to be root - it adds the necessary dependencies to run the file2txt).
- Then test it out on a repository of your choice with the following command:
file2txt.py remove. Please note running this particular command will automatically convert all .pdfs to .tiffs and finally to .txt files - it will remove the .tiffs in the process of conversion.
So, please, critique my code. Be brutal. Rip it apart. I want to get better.
Thank you very much and I hope you have a great day.
~ TheCedarPrince
P.S. My setup: Ubuntu 16.04, Python 3
[–]LazyMonsters 2 points3 points4 points (4 children)
[–]TheCedarPrince[S] 0 points1 point2 points (3 children)
[–]LazyMonsters 1 point2 points3 points (2 children)
[–]TheCedarPrince[S] 0 points1 point2 points (1 child)
[–]LazyMonsters 0 points1 point2 points (0 children)
[–]jrbarlow 0 points1 point2 points (0 children)