pythondictionarynested-listsfuzzywuzzymulti-value-dictionary

Run a query against all values within nested lists of a multi-valued dictionary


I have a 'collections.defaultdict' (see x below) that is a multi-valued dictionary. All values associated with each unique key are stored in a list.

    >>>x
    defaultdict(<type 'list'>, {'a': ['aa', 'ab', 'ac'], 'b': ['ba', 'bc'], 'c': ['ca', 'cb', 'cc', 'cd']})

I want to use the Python fuzzywuzzy package in order to search a target string against all the values nested in the multi-valued dictionary and return the top 5 matches based on fuzzywuzzy's built-in edit distance formula.

    from fuzzywuzzy import fuzz
    from fuzzywuzzy import process
    query = 'bc'
    choices = x
    result = process.extract(query, choices, limit=5)

And then I will run a process that takes the closest match (value with highest fuzz ratio score) and identifies which key that closest matched value is associated with. In this example, the closest matched value is of course 'bc' and the associated key is 'b'.

My question is: How do I run the fuzzywuzzy query against all of the values within the nested lists of the dictionary? When I run the fuzzywuzzy process above, I get a TypeError: expected string or buffer.


Solution

  • To get all the values in the lists from your dictionary in a flat list, use
    from itertools import chain and change the line

    choices = x
    

    to

    choices = chain.from_iterable(x.values())
    

    Consider making a set out of that if in your real data you have overlapping values.

    result:

    [('bc', 100), ('ba', 50), ('ca', 50), ('cb', 50), ('cc', 50)]