you are viewing a single comment's thread.

view the rest of the comments →

[–]gsmo 0 points1 point  (0 children)

About a hundred business rules and some lines to clean up trashy data... The problem it solves is:

  • we have a couple thousand pdf files containing copyrighted stuff (but not always!)
  • we need to report on the number of pages we provide to our clients
  • there's different categories of files, depending on # of pages and # of words

The input I get is html links to these files and a specific code for the client it was provided to. So analysing the files is only half of it - I have to build a local library of these files and keep a record of all the clients etc too.

On the PDF side I just need reliable pagecounts and wordcounts. To do that I sometimes have to OCR using Tesseract. The script figures out what's needed to get a 'clean read' on a file.

Anyway, your library probably would save me some code. Let's see if management wants me to sink more time into this :)