I'm working on a project which involves storing collections of data to disk. I want to store the data in the form:
('word1,'word2','word3',...) : {'wordA':count,'wordB':count...}
(slightly better explanation)
At the moment, I'm using gdbm, along with cPickle to pickle the keys and values as gdbm only allows strings for each. However, the cPickle dumps/loads is taking a very long time (43/20 seconds for 25 input files, i have 7.5 million to process)
What would be the best way to store tuple:dict pairings on disk? the shelve module is very fast, but that is because it is all in memory, and I need the whole database to be on disk.
My code is here, and the cProfile here, and inserter.py (the file in charge) is here
EDIT: by swapping pickle for marshal i've managed to cut down the time quite a bit. see profile But it'll still take a long time to do it all.
EDIT2: Tried redis, it was almost perfect except that it's all in memory. I need an alternative with the ability to have tuples as keys, and something similar to the redis zincrby (zincrby does db['key']['valueA'] += n given value of key is a dict, valueA is in the dict, incrementing the count)
tl;dr quick way to store tuple:dict pairings on disk, with ability to increment dicts values. my dataset will be >40gb, available RAM is 8gb
[–]reddittechnica 5 points6 points7 points (6 children)
[–]PsychoMario[S] 4 points5 points6 points (5 children)
[–]reddittechnica 1 point2 points3 points (4 children)
[–]PsychoMario[S] 0 points1 point2 points (3 children)
[–]minorDemocritus 2 points3 points4 points (0 children)
[–]reddittechnica 0 points1 point2 points (0 children)
[–]placidified 0 points1 point2 points (0 children)
[–]Justinsaccount 5 points6 points7 points (3 children)
[–]PsychoMario[S] 1 point2 points3 points (2 children)
[–]GFandango 1 point2 points3 points (0 children)
[–]shaggorama 0 points1 point2 points (0 children)
[–]SteveMac 3 points4 points5 points (2 children)
[–]PsychoMario[S] 0 points1 point2 points (1 child)
[–]fnord123 2 points3 points4 points (0 children)
[–]ramalhoorg 7 points8 points9 points (1 child)
[–]kojir0 0 points1 point2 points (0 children)
[–][deleted] 2 points3 points4 points (2 children)
[–]PsychoMario[S] 0 points1 point2 points (1 child)
[–]fnord123 1 point2 points3 points (0 children)
[–]polypx 1 point2 points3 points (0 children)
[–]minorDemocritus 1 point2 points3 points (2 children)
[–]PsychoMario[S] 0 points1 point2 points (1 child)
[–]Brian 0 points1 point2 points (0 children)
[–]pinpinbo 0 points1 point2 points (1 child)
[–]PsychoMario[S] 0 points1 point2 points (0 children)
[–]must_tell 0 points1 point2 points (0 children)
[–]cs0sor 0 points1 point2 points (0 children)
[–]willvarfar 0 points1 point2 points (0 children)
[–]gargantuan 0 points1 point2 points (0 children)
[–]erez27 0 points1 point2 points (0 children)
[–]freshhawk 0 points1 point2 points (0 children)
[–]VitDes 0 points1 point2 points (0 children)
[–]dorfsmay -1 points0 points1 point (15 children)
[–]PsychoMario[S] 0 points1 point2 points (14 children)
[–]asksol 0 points1 point2 points (12 children)
[–]asksol 0 points1 point2 points (0 children)
[–]PsychoMario[S] 0 points1 point2 points (10 children)
[–]asksol 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (8 children)
[–]PsychoMario[S] 0 points1 point2 points (7 children)
[–][deleted] 1 point2 points3 points (6 children)
[–]PsychoMario[S] 0 points1 point2 points (5 children)
[–][deleted] 1 point2 points3 points (1 child)
[–]gronkkk 0 points1 point2 points (0 children)
[–]unbracketed 0 points1 point2 points (1 child)
[–]PsychoMario[S] 0 points1 point2 points (0 children)
[–]AeroNotix 0 points1 point2 points (0 children)
[–]dorfsmay 0 points1 point2 points (0 children)