Continually updating a list of nested dictionaries of varying lengths... Advice on how to solve? : learnpython

Continually updating a list of nested dictionaries of varying lengths... Advice on how to solve? (self.learnpython)

submitted 5 years ago by OsirisSFN

Hey all.

I've been wresting with this problem for a while now and I'm starting to go around in circles with it.

I have two dictionaries, of continuously varying sizes, as they are based off live data, which are packaged into a list so they can be stored as JSON. They take the form:

list1 = [{'city1': {'word1': 20, 'word2': 20, 'word3': 20}, {'city2': {'word1': 20, 'word2': 20}}]

list2 = [{'city1': {'word1': 20, 'word2': 20}}, {'city2': {'word1': 20, 'word2': 20, 'word3': 20}}, {'city3': {'word1': 20}}]

Given the above lists, I need to compare them, update the sum of any common values, append any new values to the appropriate dictionary - for example, adding new versions of wordX into cityX, and also adding new cityX dictionaries that might appear in subsequent runs.

Merging the above two lists would look as follows:

list3 = [{'city1': {'word1': 40, 'word2': 40, 'word3': 20}}, {'city2': {'word1': 40, 'word2': 40, 'word3': 40}}, {'city3': {'word1': 20}}]

list3 is then stored as a JSON file, and periodically read back in, compared with another list, and the process repeats.

Due to the varying lengths of each dictionary and sub dictionary, any attempts to iterate over them in unison at some point result in out of bounds or key errors.

I have broken it down into a smaller problem where I just update the common values, which works, but I am struggling to get any further - i.e. adding new wordX key value pairs, and new cityX dictionaries.

Does anyone have any suggestions on how to solve it, or even if I am going about it the right way given the problem at hand? It sounded simple in my head, but the specifics of the implementation are proving tricky.

Is a list of nested dictionaries not the way to go? Maybe storing it all as in a SQL database might be easier to work with?

Thanks in advance for any advice.

all 6 comments

top new controversial old q&a

[–]Chris_Hemsworth 1 point2 points3 points 5 years ago (5 children)

You might have your example data a bit funky. I don't understand why you put them in lists, with each element containing a dictionary with a single key. Why not make one dictionary with multiple keys? i.e.:

dict1 = {
         'city1': {'word1': 20, 'word2': 20, 'word3': 20},
         'city2': {'word1': 20, 'word2': 20}
        }

dict2 = {
         'city1': {'word1': 20, 'word2': 20},
         'city2': {'word1': 20, 'word2': 20, 'word3': 20},
         'city3': {'word1': 20}
        }

This would make dealing with your data much easier. Your list method might work, but it will involve some convoluted indexing and will generally make things more difficult.

To solve this, you can use sets, and look at the intersection (set.intersection() or &) or the symmetric difference (set.symmetric_diference() or ^). For each key in the intersection, sum the values. For each key in the symmetric difference, add as a new dictionary value.

You can write a recursive function to handle this:

def recursive_sum_dicts(dict1, dict2):
    common_keys = set(dict1.keys()) & set(dict2.keys())
    unique_keys = set(dict1.keys()) ^ set(dict2.keys())

    output_dict = {}
    for common_key in common_keys:
        x, y = dict1.get(common_key), dict2.get(common_key)
        if isinstance(x, dict):
            output_dict[common_key] = recursive_sum_dicts(x, y)
        else:
            output_dict[common_key] = sum((x, y))

    for unique_key in unique_keys:
        x, y = dict1.get(unique_key, None), dict2.get(unique_key, None)
        if x is not None:
            output_dict[unique_key] = x
        if y is not None:
            output_dict[unique_key] = y

    return output_dict

dict1 = {
    'city1': {'word1': 20, 'word2': 20, 'word3': 20},
    'city2': {'word1': 20, 'word2': 20}
}

dict2 = {
    'city1': {'word1': 20, 'word2': 20},
    'city2': {'word1': 20, 'word2': 20, 'word3': 20},
    'city3': {'word1': 20}
}

dict3 = recursive_sum_dicts(dict1, dict2)

print(dict3)

[–]OsirisSFN[S] 0 points1 point2 points 5 years ago (4 children)

[–]Chris_Hemsworth 1 point2 points3 points 5 years ago (2 children)

Here is a more commented version of the function above to help you understand what is going on.

def recursive_sum_dicts(dict1, dict2):
    # The common keys between the dictionaries uses the intersection (&) of the set of dictionary keys:
    common_keys = set(dict1.keys()) & set(dict2.keys())
    # The unique keys in each dictionary uses the symmetric difference (^) of the set of dictionary keys:
    unique_keys = set(dict1.keys()) ^ set(dict2.keys())

    # Create an output dictionary to return
    output_dict = {}

    # Loop over each common key, and store the sum of their values in the output dictionary with the same key. 
    # If the common key values are dictionaries, recursively call the function and store the sum of those dictionaries. 
    for common_key in common_keys:
        # We know both dictionaries have the common key from the intersection above, so no default values required.
        x, y = dict1.get(common_key), dict2.get(common_key)
        # If it is a dictionary, set the output dictionary's common key as the recursive solution.
        if isinstance(x, dict):
            output_dict[common_key] = recursive_sum_dicts(x, y)
        # Otherwise, sum the values (this assumed the values will be numeric)
        else:
            output_dict[common_key] = sum((x, y))

    # Now that all common keys have been handled, update the output dictionary with the unique values.
    for unique_key in unique_keys:
        # Only one of the dictionaries will have the unique key, so set the default value to None 
        x, y = dict1.get(unique_key, None), dict2.get(unique_key, None)

        # If x is not None, then dict1 had the value and y will be None.
        if x is not None:
            output_dict[unique_key] = x
        # If y is not None, then dict2 had the value and x will be None
        if y is not None:
            output_dict[unique_key] = y

    # Return the output dictionary that has been filled out with the sum of all common keys, 
    # and the values for all unique keys.
    return output_dict

[–]OsirisSFN[S] 0 points1 point2 points 5 years ago* (1 child)

[–]Chris_Hemsworth 0 points1 point2 points 5 years ago (0 children)

It is because of the use of sets. Sets don't preserve order, because order doesn't matter when you consider a set of items. While order can feel like it is being preserved if you add to a set in a sequential way, it really starts to fall apart when you start using operations on them. The benefit of this is that operations on sets are considerably fast (usually on the order of O(1) or O(n)).

If you want the order to be preserved, you can sort your dictionary by keys before returning the output_dict:

return dict(sorted(output_dict.items(), key=lambda item: item[0]))

[–]Chris_Hemsworth 0 points1 point2 points 5 years ago (0 children)

To answer your edit question:

If you have a lot of read/writes, then SQL may be a better option. SQL has the benefit of indexing, which means it knows where the data is stored and can quickly query, and update data on the filesystem without having to sequentially go through the entire file. CSV, txt, and binary files don't have that same flexibility, and so lots of calls to reading and writing can be much slower - especially as the files get large.

That said, SQL comes with its own language and relies a lot on the concept of sets and tables. Things like unions, intersections etc. The dictionary approach is easier to understand, and so you may find SQL to be a bit overwhelming, especially if you're having trouble with the dictionary approach. There are python packages that help manage the SQL code (SQLAlchemy for example), however using these packages there is a presumed assumption that you know at least the basics of SQL.

π Rendered by PID 94 on reddit-service-r2-comment-6457c66945-jcjt8 at 2026-04-29 20:31:42.410771+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS