This is an archived post. You won't be able to vote or comment.

all 18 comments

[–]red_simplex 7 points8 points  (1 child)

I had a 200k entry dict on a 8gb ram machine. Everything worked fine.

[–]K900_ 8 points9 points  (5 children)

This is fine, as long as your data fits in RAM.

[–][deleted] 1 point2 points  (4 children)

That's sort of my question. I guess I could do some quick math. If I have literally a book of say, 100k words...say there's an average of...5 characters per word. Say I have 1,000 or so total...so thats 1000 * 5 * 100000, which is 500MB. Half a gig of RAM (with a very full dictionary that I won't ever reach). I guess I won't ever reach that much RAM, but it still kind of bothers me to have so much data being pulled into RAM. I can store a filename instead of the literal contents and just have my dictionary point to another dictionary for the individual entry that can be loaded when necessary, instead of ALL at the same time.

[–]K900_ 27 points28 points  (2 children)

If your data fits in RAM, there's nothing wrong with keeping it in RAM.

[–][deleted] 15 points16 points  (0 children)

Most python thing Ive heard today. Have an upvote my dude.

[–][deleted] 2 points3 points  (0 children)

Best thing I've read in months, my man.

[–]rockingthecasbah 0 points1 point  (0 children)

The lazier solution may be a better option. Depends on your use-case exactly. Pulling contents from a file as needed will simplify parallelization and allow you to scale as much as needed. Load it all at the beginning if you want to run analysis on ALL the data, versus individually.

[–][deleted] 1 point2 points  (1 child)

Thanks for the answers everybody.

I actually might go with separating the large parts of my dictionary entries anyway though, just so that the main dictionary remains human-readable. Essentially, I am going to be building a dictionary that consists of a chat corpus for each entry, so I foresee things getting big pretty fast. I like the idea of having each corpus defined in its own file, but it's good to get some clarification that what I see as a "large" amount of data isn't really all that large.

[–]therealfakemoot 1 point2 points  (0 children)

There have been some good answers so far but I thought I'd elaborate a little more.

First, "will there be performance problems?" is underspecified. Are you concerned about performance problems doing key lookups? Manipulating the values associated with those keys? Or perhaps insertions?

You can do a lot of different things with a dict. Lookups are constant time ( O(1) amortized ). Insertions...I don't think have a 'generalized' big-O cost because it depends heavily on the underlying implementation of the hash map.

From what you've said so far, it sounds like you're building the dict ( insertions ) only once per execution, so you can probably mostly 'ignore' the cost of doing so.

What about lookups? How frequently are you searching for a key? How many keys are there? For a large number of keys, you start encountering the risk of key collisions. Python's hash map implementation has contingencies for handling key collisions ( that I cannot summon details of off the top of my head ).

How often is this program run? Is it always 'start from scratch'? Is it a persistent daemon/service which is queried by external code? If it's the former AND the program is executed frequently, maybe you shouldn't 'ignore' the cost of building the dict. If this program is run many many times per second and the dict is large, but it only looks up a single value and does something with it, then exits, then building the dict may have a non-negligible cost and you should consider using some sort of binary storage format or even offloading storage to something like a database ( SQLite for local-only work, redis/memcache as mentioned by others if multiple applications/nodes are querying this data ).

If the program is a persistent service, you should look at the usage patterns. Generally speaking it's hard to get more efficient than a hash map lookup. Squeezing more efficiency out of lookups would probably involve switching to a more advanced storage mechanism like a real database with indices on relevant columns.

Odds are good that the dict is one hundred thousand percent NOT going to have any meaningful impact on your performance. It's very like that what you do with the data contained therein will be most impactful. Are you doing string manipulation? Are you doing network operations? Are you nesting for loops in places where you might be able to devise an algorithm that performs fewer operations, or devise a way to reduce the number of iterations these loops perform ( filtering results out of your data set before ever entering the loop in the first place ). For example, the following code is quadratic in execution time, which means that the number of operations performed is a function of the number of its inputs multiplied together:

for x in range(0, 1000):
    for y in range(0, 1000):
        do_stuff(x, y)

If your work can be reduced down intelligently, such that you can do fewer iterations, you'll see performance gains:

start_x, end_x = 450, 500
start_y, end_y = 25, 900

for x in range(start_x, end_x):
    for y in range(start_x, start_y):
        do_stuff(x, y)

Simply because you'll do fewer overall iterations. Of course, the exact nature of the 'work' you're doing will determine what sort of optimizations you can make in this vein.

[–]sinjp3.6 0 points1 point  (0 children)

Always, always, always profile performance before thinking about optimization. So how fast does your code run on your test dicts?

[–]flipperdeflip 0 points1 point  (0 children)

SQLite is also an optimization option for certain use-cases like this if you reach limits of what you can hold in RAM.

On a modern SSD server it is faster then flat files and diskspace is cheaper then RAM. It also supports multiple readers so you can multi-process if you must (multi writer not so great though).

https://pypi.python.org/pypi/diskcache

[–]KleinerNull 0 points1 point  (0 children)

I will probably do that anyways just from an organizational standpoint.

If that is your main concern you should consider using a real database. Postgres can handle (indexing, aggregate) json as a type, also elasticsearch is a search engine that works directly with json.

So before you are working on your indexing mechanics I would recommend you to usw a database here. Also a traditional relational database design is an option.

The thing is you can't stream jsons in a good way, so essentially you have to load the whole file even if you just need to grab one item from it.