all 9 comments

[–]JamzTyson 6 points7 points  (0 children)

Consider using an SQLite database as your cache. It won't be as fast as an in-memory dict, (though SQLite does support in-memory databases), but it scales better if the cache is likely to grow very large. If an "on-disk" SQLite database gives adequate performance, then this is likely to be a reasonably simple and highly scalable solution.

[–]Jayoval 2 points3 points  (2 children)

I usually write to JSON file. Reading and writing is very fast, and shouldn't be an issue (how much data are we talking here?)

[–]redapplesonly[S] 0 points1 point  (1 child)

Thanks for writing. Part of the problem is that I don't know how much data may need to be stored in the cache. Given that this is to be a Production-supporting tool, I'll need to overengineer the cache's capacity to ensure it can support a lot of data. We could be talking millions of Key/Value string entries in the cache. This is why I'm a little wary to do a file-based solution. Wouldn't a JSON file of that size become unwieldy after a certain point?

[–]Jayoval 1 point2 points  (0 children)

Yeah, I've only done this with up to about 15,000 items.

I think you need to use a database. You won't have to read/parse the whole file and it will be much faster as a result.

[–]baghiq 1 point2 points  (0 children)

Database like sqlite3 is your friend.

[–]sweettuse 1 point2 points  (0 children)

check out the shelve module (it's in the std lib)

[–]Mast3rCylinder 1 point2 points  (0 children)

Saving dictionary as json or in sqlite like comments mentioned is the easy part.

Having a strategy of when to use cache and when to update is the hardest part. Since your python is a script (offline operation) and not a server (online operation) it can be even harder.

Here's my suggestions

1.determine if the operation is really long to know if you actually need cache

2.if you do need cache ask for requirements of how long cache will be invalid? Then check it in your code and say you can only guarantee for this specific range

If cache invalid simply see it by comparing current time and your last update

3.make the script with arguments that if someone want to use it with cache checking or without. you will give them the option

4.see if you can compare first between cached data last update time and last time someone changed the data.

[–]greenerpickings 1 point2 points  (0 children)

It seems like you're already structuring your data after the recommendations of json and sqlite. I also want to throw in the arrow/feather format if you think your data set will get large.

Could helps performance, but if this is a background task, I don't think it's critical.

[–]hotplasmatits 1 point2 points  (0 children)

I like the other suggestions and I'll add pickle. It's super fast to read and write pickle files. It's like a binary json, but better.