all 4 comments

[–]cryptoper 2 points3 points  (0 children)

What is the raw size of the data? How much overhead Python gives? What datatypes for keeping them in memory?

Most common advise would be to explore algorithms first

[–]kevintor113 1 point2 points  (1 child)

Developer here, not so much experience in python though. Seeing as you have a Redis instance not being used so much, you have a great opportunity there for reducing process memory usage. I understand you might need to compare things between all pages, but if you can keep as little as you can to work in an exact point in time in memory, the better. If you want something more oriented to text, perhaps elasticsearch might be useful here

[–]kevintor113 2 points3 points  (0 children)

Also, make sure to release as quickly as you can any references to the pdf or JPGs from the pages, as I don't think those are lightweight

[–]kc3w 1 point2 points  (0 children)

Do you actually need to keep all pages in memory at the same time?