pythondictionary

python dictionary count of unique values


I have a problem with counting distinct values for each key in Python.

I have a dictionary d like

[{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]

I need to print number of distinct values per each key individually.

That means I would want to print

abc 3
xyz 1
pqr 4

Please help.

Thank you


Solution

  • Over 6 years after answering, someone pointed out to me I misread the question. While my original answer (below) counts unique keys in the input sequence, you actually have a different count-distinct problem; you want to count values per key.

    To count unique values per key, exactly, you'd have to collect those values into sets first:

    values_per_key = {}
    for d in iterable_of_dicts:
        for k, v in d.items():
            values_per_key.setdefault(k, set()).add(v)
    counts = {k: len(v) for k, v in values_per_key.items()}
    

    which for your input, produces:

    >>> values_per_key = {}
    >>> for d in iterable_of_dicts:
    ...     for k, v in d.items():
    ...         values_per_key.setdefault(k, set()).add(v)
    ...
    >>> counts = {k: len(v) for k, v in values_per_key.items()}
    >>> counts
    {'abc': 3, 'xyz': 1, 'pqr': 4}
    

    We can still wrap that object in a Counter() instance if you want to make use of the additional functionality this class offers, see below:

    >>> from collections import Counter
    >>> Counter(counts)
    Counter({'pqr': 4, 'abc': 3, 'xyz': 1})
    

    The downside is that if your input iterable is very large the above approach can require a lot of memory. In case you don't need exact counts, e.g. when orders of magnitude suffice, there are other approaches, such as a hyperloglog structure or other algorithms that 'sketch out' a count for the stream.

    This approach requires you install a 3rd-party library. As an example, the datasketch project offers both HyperLogLog and MinHash. Here's a HLL example (using the HyperLogLogPlusPlus class, which is a recent improvement to the HLL approach):

    from collections import defaultdict
    from datasketch import HyperLogLogPlusPlus
    
    counts = defaultdict(HyperLogLogPlusPlus)
    
    for d in iterable_of_dicts:
        for k, v in d.items():
            counts[k].update(v.encode('utf8'))
    

    In a distributed setup, you could use Redis to manage the HLL counts.


    My original answer:

    Use a collections.Counter() instance, together with some chaining:

    from collections import Counter
    from itertools import chain
    
    counts = Counter(chain.from_iterable(e.keys() for e in d))
    

    This ensures that dictionaries with more than one key in your input list are counted correctly.

    Demo:

    >>> from collections import Counter
    >>> from itertools import chain
    >>> d = [{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]
    >>> Counter(chain.from_iterable(e.keys() for e in d))
    Counter({'pqr': 5, 'abc': 3, 'xyz': 1})
    

    or with multiple keys in the input dictionaries:

    >>> d = [{"abc":"movies", 'xyz': 'music', 'pqr': 'music'}, {"abc": "sports", 'pqr': 'movies'}, {"abc": "music", 'pqr': 'sports'}, {"pqr":"news"}, {"pqr":"sports"}]
    >>> Counter(chain.from_iterable(e.keys() for e in d))
    Counter({'pqr': 5, 'abc': 3, 'xyz': 1})
    

    A Counter() has additional, helpful functionality, such as the .most_common() method that lists elements and their counts in reverse sorted order:

    for key, count in counts.most_common():
        print '{}: {}'.format(key, count)
    
    # prints
    # 5: pqr
    # 3: abc
    # 1: xyz