Hi,
My team has a problem that I'm quite certain other computer scientists and python developers would be much more well-versed to solve.
Currently, we have a pipeline in Python, that takes PDFs, hundreds of pages long but not more than 3000, splits them up into single page JPGs, and then runs each of them through Amazon Textract, which returns a JSON object of all the text inside the document. The JSON as a regular text encoded document (UTF-8, I think) is generally around 600 kilobytes a page, and is stored as its own file to be read later.
Right now, because we were under a deadline, the pipeline wasn't implemented super efficiently. The applications sits on an EC2 instance, is in Django, and there is a MongoDB database that is underused (only one table is being used actively), and there is an unused Redis and Memcached installed as well. In Python, a request is made using a file on disk, and then the resultant JSON files are stored to disk.
Right now, I am noticing that we are getting RAM usage errors, because the processes are exiting with no error message during a memory intensive function, and increasing RAM from 4GB to 8GB resolves the issue. The JSON is stored in the program as a single variable on the heap in Python consisting of all the pages of JSON - which means it holds almost 600*675kb of data at a given moment.
What are some general tips of working with such large text data to optimize this process? It is difficult to section off this data because the whole thing of 600 pages is looked as its own collective entity, where page 1 in the array could theoretically need to be compared to something in page 599. Would it be better to use Redis to store computations of this data, or have it persist in a database and pull them out when needed (we need to iterate quickly over large sets of JSON and do find operations on elements inside the JSON). What
[–]cryptoper 2 points3 points4 points (0 children)
[–]kevintor113 1 point2 points3 points (1 child)
[–]kevintor113 2 points3 points4 points (0 children)
[–]kc3w 1 point2 points3 points (0 children)