I'm trying to create a nested dictionary in Python so that, given a list of strings, the dictionary records the number of occurrences of that string order.
For example, if the string list is:
["hey", "my", "name", "is"]
I would want the nested dictionary to look like:
{"hey": {"my": {"name": {"is": 1}}}}
I know I could probably use as key the whole list but I specifically want to separate the strings in the dictionary.
I also would want to approach this problem with a defaultdict
dictionary, not the Python ones, and preferably use a recursively defined defaultdict
.
This is what I tried:
from collections import defaultdict
nested_dict = lambda: defaultdict(nested_dict)
# Initialize ngrams as a nested defaultdict
ngrams = nested_dict()
# Function to update the nested defaultdict with the list of words
def update_ngrams(ngrams, words):
current_dict = ngrams
for word in words[:-1]:
current_dict = current_dict[word]
current_dict[words[-1]] += 1
# Example usage
update_ngrams(ngrams, ["My", "big", "cat"])
update_ngrams(ngrams, ["My", "big", "dog"])
but it gives me this error:
TypeError: unsupported operand type(s) for +=: 'collections.defaultdict' and 'int'
The expected output should be a map like this:
{"My": {"big": {"cat": 1, "dog": 1}}}
Here is a solution just using the standard dictionary. It creates a new dictionary for every layer except the last one, where it increments an integer.
def update_ngrams(ng: dict, words: list[str]):
n = len(words)
d = ng
for i, w in enumerate(words,1):
if i == n:
d.setdefault(w, 0)
d[w] += 1
else:
d = d.setdefault(w, {})
ngrams = {}
update_ngrams(ngrams, ["My", "big", "cat"])
update_ngrams(ngrams, ["My", "big", "dog"])
ngrams
# returns:
{'My': {'big': {'cat': 1, 'dog': 1}}}
The issue you are going to run into is that it does not handle cases where a leaf ('cat' or 'dog') key is also a branch. I.e.: ["My", "big", "dog", "Noodle"]
.
My suggestion would be to store the leaf occurrence as a dictionary as well with a specific key to indicate it is the leaf count. Here is an example using the underscore as the leaf counter.
def update_ngrams(ng: dict, words: list[str]):
n = len(words)
d = ng
for i, w in enumerate(words,1):
d = d.setdefault(w, {})
if i == n:
d.setdefault('_', 0)
d['_'] += 1
ngrams = {}
update_ngrams(ngrams, ["My", "big", "cat"])
update_ngrams(ngrams, ["My", "big", "dog"])
update_ngrams(ngrams, ["My", "big", "dog", "Noodle"])
ngrams
# returns:
{'My': {'big': {'cat': {'_': 1}, 'dog': {'_': 1, 'Noodle': {'_': 1}}}}}