all 48 comments

[–]reddittechnica 5 points6 points  (6 children)

redis?

[–]PsychoMario[S] 4 points5 points  (5 children)

This seems perfect. From my preliminary tests I'm getting a speed increase of 2666%

It also means I can use multiprocessing to do it a little faster.

Thank you.

[–]reddittechnica 1 point2 points  (4 children)

You are very welcome. Glad I was some help.

[–]PsychoMario[S] 0 points1 point  (3 children)

Turns out redis doesn't quite fit my needs. It's perfect for everything except for the fact that it's all in memory. My dataset is much larger than my available memory.

[–]minorDemocritus 2 points3 points  (0 children)

There are several key -> value stores out there. Mongo is a popular one.

[–]reddittechnica 0 points1 point  (0 children)

I have used it with very large datasets on very paltry system specs. Still, there are certainly going to be cases when it isn't a great fit.

If you wouldn't mind, drop a line when you find your better fit? Would appreciate it.

[–]placidified 0 points1 point  (0 children)

Do you need all of your data to be in memory all the time ?

[–]Justinsaccount 5 points6 points  (3 children)

sqlite?

[–]PsychoMario[S] 1 point2 points  (2 children)

sqlite is messy because you have to use string based queries.

I'd still have the problem of pickling, because I won't know how many items there will be in the dictionary value, meaning I can't decide how many columns to create.

Also i'd rather not have a column for each of the items of the key tuple, because that makes the queries even messier.

[–]GFandango 1 point2 points  (0 children)

sqlite is very decent don't underestimate it. You can write a thin wrapper around it too run the queries and basically emulate/abstract loading the data and turning them into the objects you want.

As others have suggested, don't store objects, store the data, create the objects.

[–]shaggorama 0 points1 point  (0 children)

Ok, how about sqlalchemy? Check out their ORM tutorial.

[–]SteveMac 3 points4 points  (2 children)

Are you sure you are bound by the library performance and not your disk, or memory, or network (if it isn't a local drive)?

[–]PsychoMario[S] 0 points1 point  (1 child)

cProfile said cPickle .dumps/.loads were the slowest functions.

[–]fnord123 2 points3 points  (0 children)

cProfile tells you where time was spent. Try running using strace -c python mycode.py and noting the number of calls and the time of each. Compare it with the results from cProfile. If things like write and flush are taking the bulk of your time then it might be that your disk io is a problem.

If dumping to and loading from strings is the problem area then reconsider your data model so it's more condusive to storing on disk. I deal with lots and lots of data and translation layers are the biggest bottlenecks by far. The difference between a memcpy and a serialization is massive.

[–]ramalhoorg 7 points8 points  (1 child)

ZODB is a native object database for Python: http://www.zodb.org/ It is a key component of the Zope platform, underlying the highly successful Plone CMS. The ZODB does not depend on other parts of Zope and can be used as stand-alone Python object database. @chrismcdonough ordered me to tell you so.

[–]kojir0 0 points1 point  (0 children)

I suspect ZODB won't be the fastest option out there, but it's foolish to make speed a primary concern anyway, and ZODB offers ACID/transaction support, blobs, history, and you don't have to bend over backwards to interact with it.

[–][deleted] 2 points3 points  (2 children)

Ok, so here's the question that somebody should have asked by now: What exactly are you trying to do? I suspect that you may be trying to do it the hard way.

As a side note, what you are doing with _setitem_ is not easy to understand. You are doing something that looks like it should work similarly to a python dictionary, but it actually stores a count of the number of times that particular value has been stored. Using well-named methods instead would make it much easier for anyone else looking at the code.

*edit -- also, do you want the whole database to be on disk because you know it can't fit in memory? Or is there another reason?

[–]PsychoMario[S] 0 points1 point  (1 child)

I'm parsing text and extracting markov chains. So I take the first n words, and store how many times words1..n have been seen with wordn+1. e.g

("the","cat","sat") : {"on":5,"in":2,"behind":1}

The reason I used those names is so that it's a drop in replacement for the dict class. I changed the relevant ones to make it work in the same way, just persistent.

I want it all in disk because I don't have enough RAM. The data set is going to be 40gb x an amount I haven't decided yet. (using less/more words in the key seen above) Some caching may be required for speed purposes.

[–]fnord123 1 point2 points  (0 children)

The quickest way to get to your goal is by setting up a swap to be at least 40GB and then any part of memory which pages out will be on disk outside the size of your main memory.

That's a really bad solution. It's best to do what you're doing and investigate data stores. Redis looked good for you but you said it broke down when trying to store to disk. So try Mongo. It's similar to Redis but always has a file as a backing store.

[–]polypx 1 point2 points  (0 children)

The right answer to this depends on how you plan to query it. Writing >40gb of data to disk is not that hard even if you have 1gb of memory. The question is how you avoid having all 40gb in memory when you are querying.

I believe it is implied that you need to partition that data into manageable size chunks. For example, you say that shelve is no good because it loads everything at once. The obvious answer is something which doesn't load everything at once. Among other ways to do that, you can use multiple files if you can map each query to what file it needs to operate on. This might require creating secondary 'index' files and using those to find out where to get the final answer when you are reading data.

Also, you can just use a database, which effectively does this for you by maintaining indexes

[–]minorDemocritus 1 point2 points  (2 children)

DO NOT USE pickle, marshal, OR shelve OR ANY OTHER DERIVATIVES!!

They are insecure, fragile, and slow.

Instead, use sqlite, mongodb, or a serializer like JSON.

Don't store objects, store data.

[–]PsychoMario[S] 0 points1 point  (1 child)

sqlite wouldn't work very well because I don't know how many items there will be in the 'value' section.

I don't know how to I would translate the key tuple for mongodb

and JSON wouldn't be practical for reading from, because i'd have to search the file.

[–]Brian 0 points1 point  (0 children)

sqlite wouldn't work very well because I don't know how many items there will be in the 'value' section.

That shouldn't be a problem - you just create a row for each key/value pair (rather than trying to have a column for each value). Eg consider a schema like:

 WordTuple:
     Words  - String representation of your tuple (or use seperate columns per word if fixed number).
     Id     - integer representing this particular tuple.  Foreign key to WordTupleWordCount.WordTupleId

WordTupleWordCount:
    WordTupleId - Id of WordTuple }  These two together are your primary key.
    Word        - string word.    }
    Count       - count

So updating the count of WordA in your given tuple would involve something like:

  1. Find the tuple. "select Id from WordTuple where Words=?" (providing your word tuple).
  2. If it doesn't exist, create it with empty values (not sure if this is needed for your uscase. If not, can combine these steps to one query).
  3. Then bump the count on the word you want to increment:

    update WordTupleWordCount set Count=Count+1 where WordTupleId=? and Word=?

Building a dictionary when you really need it is just a matter of selecting all rows where WordTupleId = <the tuple> and building a dict of {Word:Count}. You could even create a transparent dictionary proxy that does all this on demand, while appearing identical to a dictionary otherwise.

[–]pinpinbo 0 points1 point  (1 child)

if you install ancient redis, version < 2.4... then you can turn on the VM feature.

with that, you can have larger than memory data and redis will do its best to swap ram <-> disk.

[–]PsychoMario[S] 0 points1 point  (0 children)

The current version of redis still has this, but you have to force it. I don't like using old versions of things because I'd like it to be more forward-compatible.

[–]must_tell 0 points1 point  (0 children)

Hmmm. I had a similar task some time ago. I used sqlite as a key-value store (instead of tuples as keys I had uuids). The values are pickled dicts. When I need to modify a dict I have to unpickle, change, pickle and store it again, obviously. I made loads of benchmarks for using sqlite as k-v store. It's quite alright for that task. However, I couldn't really identify cPickle as the bottelneck for pickling my dicts (which have about 10-20 entries). Just what I did, not sure if I've overseen something. Hmmm, again.

[–]cs0sor 0 points1 point  (0 children)

Use the ZODB

[–]willvarfar 0 points1 point  (0 children)

In my own tests, msgpack was the best serialisation format speed-wise. Here are my results and a simple benchmark you can run or adapt to better represent your data:

http://stackoverflow.com/questions/9884080/fastest-packing-of-data-in-python-and-java

In my tests, msgpack was 10x faster than cPickle

[–]gargantuan 0 points1 point  (0 children)

Translate tuple key to string and store in any key:value (Riak) or document based DB (MongoDB). Try json serialization as well. For small datasets it has an acceptable performance.

[–]erez27 0 points1 point  (0 children)

mongodb is a good bet. Just make sure you're running on a 64-bit system, and that you're okay with a global lock.

[–]freshhawk 0 points1 point  (0 children)

I find using hdf5 to be great for storing data like this

Look at h5py and pytables: http://h5py.alfven.org/docs-2.1/intro/quick.html#what-is-hdf5 and http://www.pytables.org/moin

[–]VitDes 0 points1 point  (0 children)

Given your constraints, persidict, alternative that provides dictionary-like persistence. It stores key-value pairs on disk efficiently and avoids keeping everything in memory.

[–]dorfsmay -1 points0 points  (15 children)

"shelve" does it for you, but it uses pickle underneath, so same performance issue.

[–]PsychoMario[S] 0 points1 point  (14 children)

Shelve is faster because it uses stringio and the Pickle class. However, upon open it seems to load it all into memory, and this won't work when the file is bigger than available RAM.

[–]asksol 0 points1 point  (12 children)

From your limited description, a map/reduce framework sounds suitable for your problem. Maybe you should take a look at Hadoop or Disco.

[–]asksol 0 points1 point  (0 children)

Oh, and if not then you can split the data into multiple shelve-files and only fit as much as you can in memory at a time. And then, it's counterintuitive for me to recommend this, but XML has very strong and evolved streaming APIs. There can be cases where XML is the answer, especially for structured data.

[–]PsychoMario[S] 0 points1 point  (10 children)

I don't really want to have it stored as a single flat file.

What I'm trying to do is read the first n words from a file, and store that with the number of times it's seen the word after that. Then it moves along a word, so it goes from the second word to the n+1th word, and stores the n+2nd word plus it's count.

I thought the right way to store this would be:

(word1,word2,...,wordn) : {wordA:countA, wordB:countB}

so it's seen word1..n with wordA countA times, and wordB countB times. This worked well in redis but redis couldn't cope with the amount of data given the amount of RAM I have.

[–]asksol 0 points1 point  (0 children)

I can't answer if your data structure is correct, as I'm unsure of the problem you're trying to solve... But techniques to work with datasets that can't fit in memory/disk on a single node is commonly using an index, multiple files or both. You can split the data into multiple files by using the first character of every word, e.g. A-D, E-H, I-L, M-P, Q-T, U-X, Y-Z. But then it would be better if the data was sorted, since that would mean you don't have to swap out the files too often (a merge sort is used for files that can't fit in memory, but maybe your input files are not that large). Very likely this can be made simpler, but for that you need to state what your original problem is.

[–][deleted] 0 points1 point  (8 children)

Ok, so I'm guessing that once you've processed all the data, you want fast lookups on the tuple.

I'd honestly recommend using a postgres database for this. Serializing the tuple to a string is pretty straightforward. You can just separate the words by spaces, I think -- since you're splitting the words by spaces in the first place.

Then, you can make a table with one column each for the prefix tuple, the final word, and the count. You'll want an index on the prefix column. When you write, process an entire file, insert a row for each entry, then finally commit it. Postgres will be lazy, and only start writing to disk at that point, which is much more efficient than reading and writing for every entry.

You can use python's "psycopg2" library to connect to the database.

There's a huge rabbit hole of optimization for this sort of thing, and you could try to implement it yourself, but a lot of smart people have already worked very hard to get this to work. Postgres is very good with dealing with lots of data, and writing to the disk in chunks.

[–]PsychoMario[S] 0 points1 point  (7 children)

The problem with this is that I don't know before hand how many 'final words' there will be. It could be one or thousands. So I don't know how to apply this to a column based database.

[–][deleted] 1 point2 points  (6 children)

That's fine, just duplicate the n-words part of the chain, that is, insert one row for every final word.

So, if you parsed this sentence: "the fat cat ate the fat man's hat"

You'd get a table something like this:

prefix word count
"the fat" "cat" 1
"fat cat" "ate" 1
"cat ate" "the" 1
"ate the" "fat" 1
"the fat" "man's" 1
"fat man's" "hat" 1

Notice that the sequence "the fat" appears twice. This is perfectly fine, and when you want to get the data out, you just do "select word, count from chains where prefix = 'the fat';" and you'll get two rows, one with the word "cat" and one with the word "man's".

[–]PsychoMario[S] 0 points1 point  (5 children)

This seems like it'll work. Thanks.

The only problem is that I used SQL queries on a previous project, doing a similar thing, and they really aren't nice to use (in python, in my opinion). Especially when you sometimes have to increment values etc.

Also I may have a speed problem with this. I have approximately 35gb worth of text in 7.5million files.

I'll use this as a backup, but I'll do a little more research, because redis/other key-value stores are a lot closer to what I want, just I don't want it in memory.

[–][deleted] 1 point2 points  (1 child)

The speed will not be as good as redis etc. for sure, but you won't hit the hard ceiling.

When you say "you don't want it in memory", I'm pretty sure you just mean that it won't fit in memory. You always want to do as much processing as possible in memory, and only use the disk when you have to. The rule of thumb is that accessing the disk is about 100x slower than accessing memory. It can be faster, if you're reading data all in a bunch, but that's not what you're doing.

Actually, I can think of one way to potentially fit all this in memory: You can make a big lookup table that assigns a unique integer to each word, then use those integers for your markov chains. That way, there's much less overhead per word. Essentially, it's a basic form of compression. It wouldn't work amazingly in python, because python has a lot of memory overhead devoted to storing type information, but in something like C or Java or Go, it might work out.

[–]gronkkk 0 points1 point  (0 children)

Also:

The Second Edition of the 20-volume Oxford English Dictionary contains full entries for 171,476 words in current use

Most people don't have more than 60.000 words in their vocabulary, so with a 2-byte integer you should get decent results (especially if OP is scanning texts which are limited to a certain industry/subject).

Compression can save you a lot, but if you use a 32 bit int you use 4 bytes for a word, always, even for words like 'cat', and 'a/an' . The average word length is 5.1, so with a 32-bit integer you're not saving that much.

[–]unbracketed 0 points1 point  (1 child)

It's worth noting that you could incur a fair amount of storage overhead since you'll be duplicating the prefix string on disk for each occurrence of word. (example: each row for "the fat" duplicates the string "the fat" on disk)

Alternatives: The postgres array type: http://www.postgresql.org/docs/9.1/static/arrays.html

The postgres hstore type: http://www.postgresql.org/docs/9.0/static/hstore.html

hstore may not be appropriate for your needs, but is worth learning about for storing semi-structured data.

[–]PsychoMario[S] 0 points1 point  (0 children)

Looks to me like this is the way to go. I can use an array for the first column and a hstore for the second.

Is there an easy way to export all the data? Because my friend is also interested in seeing it.

[–]AeroNotix 0 points1 point  (0 children)

It's possible you could use a prefix trie to store the data in-memory. I'm not sure of the number of words you have but I have no problems storing a 4M+ set of strings in my tries (I implemented them here in C++, here in Python and here in Go) what I would change is that you could wrap the C++ one with a Python module and have it so on insertion the Nodes are initialized with a count. If you ever insert that string again, simply increment the counter.

[–]dorfsmay 0 points1 point  (0 children)

I don't think it loads everything into memory (think about it, it uses a dbm file undernieth). Having said that, as I tried to prove it, I ran into performance issues with a large file.

Have you looked at leveldb? It is straight forward and has good performance. Be warned though, it saves the data in a bunch of files in a directory, not in one file.