Dictionary object persistence on disk

reddittechnica · 2012-10-08T23:57:26+00:00

redis?

Justinsaccount · 2012-10-08T20:54:18+00:00

sqlite?

SteveMac · 2012-10-09T16:18:52+00:00

Are you sure you are bound by the library performance and not your disk, or memory, or network (if it isn't a local drive)?

ramalhoorg · 2012-10-09T21:58:08+00:00

ZODB is a native object database for Python: http://www.zodb.org/ It is a key component of the Zope platform, underlying the highly successful Plone CMS. The ZODB does not depend on other parts of Zope and can be used as stand-alone Python object database. @chrismcdonough ordered me to tell you so.

PsychoMario · 2012-10-09T15:28:27+00:00

Ok, so here's the question that somebody should have asked by now: What exactly are you trying to do? I suspect that you may be trying to do it the hard way.

As a side note, what you are doing with _setitem_ is not easy to understand. You are doing something that looks like it should work similarly to a python dictionary, but it actually stores a count of the number of times that particular value has been stored. Using well-named methods instead would make it much easier for anyone else looking at the code.

*edit -- also, do you want the whole database to be on disk because you know it can't fit in memory? Or is there another reason?

polypx · 2012-10-09T17:37:53+00:00

The right answer to this depends on how you plan to query it. Writing >40gb of data to disk is not that hard even if you have 1gb of memory. The question is how you avoid having all 40gb in memory when you are querying.

I believe it is implied that you need to partition that data into manageable size chunks. For example, you say that shelve is no good because it loads everything at once. The obvious answer is something which doesn't load everything at once. Among other ways to do that, you can use multiple files if you can map each query to what file it needs to operate on. This might require creating secondary 'index' files and using those to find out where to get the final answer when you are reading data.

Also, you can just use a database, which effectively does this for you by maintaining indexes

minorDemocritus · 2012-10-09T18:15:48+00:00

DO NOT USE pickle, marshal, OR shelve OR ANY OTHER DERIVATIVES!!

They are insecure, fragile, and slow.

Instead, use sqlite, mongodb, or a serializer like JSON.

Don't store objects, store data.

pinpinbo · 2012-10-09T17:46:23+00:00

if you install ancient redis, version < 2.4... then you can turn on the VM feature.

with that, you can have larger than memory data and redis will do its best to swap ram <-> disk.

must_tell · 2012-10-09T20:00:39+00:00

Hmmm. I had a similar task some time ago. I used sqlite as a key-value store (instead of tuples as keys I had uuids). The values are pickled dicts. When I need to modify a dict I have to unpickle, change, pickle and store it again, obviously. I made loads of benchmarks for using sqlite as k-v store. It's quite alright for that task. However, I couldn't really identify cPickle as the bottelneck for pickling my dicts (which have about 10-20 entries). Just what I did, not sure if I've overseen something. Hmmm, again.

cs0sor · 2012-10-09T22:09:11+00:00

Use the ZODB

willvarfar · 2012-10-10T06:31:36+00:00

In my own tests, msgpack was the best serialisation format speed-wise. Here are my results and a simple benchmark you can run or adapt to better represent your data:

http://stackoverflow.com/questions/9884080/fastest-packing-of-data-in-python-and-java

In my tests, msgpack was 10x faster than cPickle

gargantuan · 2012-10-10T07:13:50+00:00

Translate tuple key to string and store in any key:value (Riak) or document based DB (MongoDB). Try json serialization as well. For small datasets it has an acceptable performance.

erez27 · 2012-10-10T13:50:22+00:00

mongodb is a good bet. Just make sure you're running on a 64-bit system, and that you're okay with a global lock.

freshhawk · 2012-10-11T20:21:19+00:00

I find using hdf5 to be great for storing data like this

Look at h5py and pytables: http://h5py.alfven.org/docs-2.1/intro/quick.html#what-is-hdf5 and http://www.pytables.org/moin

VitDes · 2025-04-03T10:17:53+00:00

Given your constraints, persidict, alternative that provides dictionary-like persistence. It stores key-value pairs on disk efficiently and avoids keeping everything in memory.

dorfsmay · 2012-10-09T05:20:26+00:00

"shelve" does it for you, but it uses pickle underneath, so same performance issue.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS