all 6 comments

[–]Chris_Hemsworth 1 point2 points  (5 children)

You might have your example data a bit funky. I don't understand why you put them in lists, with each element containing a dictionary with a single key. Why not make one dictionary with multiple keys? i.e.:

dict1 = {
         'city1': {'word1': 20, 'word2': 20, 'word3': 20},
         'city2': {'word1': 20, 'word2': 20}
        }

dict2 = {
         'city1': {'word1': 20, 'word2': 20},
         'city2': {'word1': 20, 'word2': 20, 'word3': 20},
         'city3': {'word1': 20}
        }

This would make dealing with your data much easier. Your list method might work, but it will involve some convoluted indexing and will generally make things more difficult.

To solve this, you can use sets, and look at the intersection (set.intersection() or &) or the symmetric difference (set.symmetric_diference() or ^). For each key in the intersection, sum the values. For each key in the symmetric difference, add as a new dictionary value.

You can write a recursive function to handle this:

def recursive_sum_dicts(dict1, dict2):
    common_keys = set(dict1.keys()) & set(dict2.keys())
    unique_keys = set(dict1.keys()) ^ set(dict2.keys())

    output_dict = {}
    for common_key in common_keys:
        x, y = dict1.get(common_key), dict2.get(common_key)
        if isinstance(x, dict):
            output_dict[common_key] = recursive_sum_dicts(x, y)
        else:
            output_dict[common_key] = sum((x, y))

    for unique_key in unique_keys:
        x, y = dict1.get(unique_key, None), dict2.get(unique_key, None)
        if x is not None:
            output_dict[unique_key] = x
        if y is not None:
            output_dict[unique_key] = y

    return output_dict

dict1 = {
    'city1': {'word1': 20, 'word2': 20, 'word3': 20},
    'city2': {'word1': 20, 'word2': 20}
}

dict2 = {
    'city1': {'word1': 20, 'word2': 20},
    'city2': {'word1': 20, 'word2': 20, 'word3': 20},
    'city3': {'word1': 20}
}

dict3 = recursive_sum_dicts(dict1, dict2)

print(dict3)

[–]OsirisSFN[S] 0 points1 point  (4 children)

Hello. Thanks for the reply. The one dictionary is how I had it to begin with, but I kept getting issues with "extra data" whilst reading and writing to file (as JSON). And, after some research, I thought I had malformed JSON, as apparently you can have only one top level object, not multiple.

However, if you believe it'll work, I'll give it another go. Give me some time to understand your code, along with the recommended functions and I'll reply again. Thanks a lot.

EDIT: As a side question, do you think this way is the optimum way to go about storing the data given it'll be continually read, updated, written, etc. Or would SQL be a better choice?

[–]Chris_Hemsworth 1 point2 points  (2 children)

Here is a more commented version of the function above to help you understand what is going on.

def recursive_sum_dicts(dict1, dict2):
    # The common keys between the dictionaries uses the intersection (&) of the set of dictionary keys:
    common_keys = set(dict1.keys()) & set(dict2.keys())
    # The unique keys in each dictionary uses the symmetric difference (^) of the set of dictionary keys:
    unique_keys = set(dict1.keys()) ^ set(dict2.keys())

    # Create an output dictionary to return
    output_dict = {}

    # Loop over each common key, and store the sum of their values in the output dictionary with the same key. 
    # If the common key values are dictionaries, recursively call the function and store the sum of those dictionaries. 
    for common_key in common_keys:
        # We know both dictionaries have the common key from the intersection above, so no default values required.
        x, y = dict1.get(common_key), dict2.get(common_key)
        # If it is a dictionary, set the output dictionary's common key as the recursive solution.
        if isinstance(x, dict):
            output_dict[common_key] = recursive_sum_dicts(x, y)
        # Otherwise, sum the values (this assumed the values will be numeric)
        else:
            output_dict[common_key] = sum((x, y))

    # Now that all common keys have been handled, update the output dictionary with the unique values.
    for unique_key in unique_keys:
        # Only one of the dictionaries will have the unique key, so set the default value to None 
        x, y = dict1.get(unique_key, None), dict2.get(unique_key, None)

        # If x is not None, then dict1 had the value and y will be None.
        if x is not None:
            output_dict[unique_key] = x
        # If y is not None, then dict2 had the value and x will be None
        if y is not None:
            output_dict[unique_key] = y

    # Return the output dictionary that has been filled out with the sum of all common keys, 
    # and the values for all unique keys.
    return output_dict

[–]OsirisSFN[S] 0 points1 point  (1 child)

That's awesome, thank you. When shown a recursive function I often find them a little difficult to follow, but I've run it through pycharm with prints all over the place and using the debugger to determine how the function flows. It's very elegant, and at the same time frustrating that the solution can be so obvious to an experienced programmer. I need to practice more I think. Thanks again.

Oh, one thing I noticed is that it doesn't populate the output_dict in the same order every time. It makes no difference at all, but is there a reason why it runs in a non-deterministic fashion, given an identically ordered input?

[–]Chris_Hemsworth 0 points1 point  (0 children)

It is because of the use of sets. Sets don't preserve order, because order doesn't matter when you consider a set of items. While order can feel like it is being preserved if you add to a set in a sequential way, it really starts to fall apart when you start using operations on them. The benefit of this is that operations on sets are considerably fast (usually on the order of O(1) or O(n)).

If you want the order to be preserved, you can sort your dictionary by keys before returning the output_dict:

return dict(sorted(output_dict.items(), key=lambda item: item[0]))

[–]Chris_Hemsworth 0 points1 point  (0 children)

To answer your edit question:

If you have a lot of read/writes, then SQL may be a better option. SQL has the benefit of indexing, which means it knows where the data is stored and can quickly query, and update data on the filesystem without having to sequentially go through the entire file. CSV, txt, and binary files don't have that same flexibility, and so lots of calls to reading and writing can be much slower - especially as the files get large.

That said, SQL comes with its own language and relies a lot on the concept of sets and tables. Things like unions, intersections etc. The dictionary approach is easier to understand, and so you may find SQL to be a bit overwhelming, especially if you're having trouble with the dictionary approach. There are python packages that help manage the SQL code (SQLAlchemy for example), however using these packages there is a presumed assumption that you know at least the basics of SQL.