pythontriedefaultdict

Python `collections.defaultdict` for the same class


I try to use a Trie data structure for some coding problem. For each node in a trie, you typically put a a list of reference of its children. So, I thought about using defaultdict to create a default empty trie node if some children does not exists in a lookup. However, I don't know how to use defaultdict to refer to the class that enclose it.

I tried two methods, which were both failed. The following is what I tried.

from dataclasses import dataclass
from collections import defaultdict

@dataclass   
class TrieNode():
    is_word = False
    children = defaultdict("TrieNode")

The code above produce

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<stdin>", line 4, in TrieNode
TypeError: first argument must be callable or None
@dataclass   
class TrieNode():
    is_word = False
    children = defaultdict(TrieNode)

The above will produce

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<stdin>", line 4, in TrieNode
NameError: name 'TrieNode' is not defined

My question is about how do you use defaultdict to implement this elegantly. Thank you very much in advance.


Solution

  • Your second approach with children = defaultdict(TrieNode) is closer to correct, since defaultdict needs the constructor for TrieNode in order to populate it with TrieNodes - the other approach passes a string where a callable is expected. Your problem is being caused by the fact that you are accessing the name TrieNode before the class has finished being created, giving the NameError. To fix this you can use children = defaultdict(lambda: TrieNode()). This way, the name TrieNode is only looked up when the lambda function is called.

    However, this leads us to another bug! Take this code:

    class Thingy:
      pass
      
    @dataclass
    class MyDataclass:
      my_thingy = Thingy()
    

    A line my_thingy = Thingy() inside a dataclass definition will create an instance of Thingy, and use it as the default value for the my_thingy attribute - all new instances of MyDataclass will use this same instance of Thingy unless they specify a different value in the constructor:

    class Thingy:
      def __init__(self):
        print("a Thingy is born")
    
    print("Thingy has been defined")
      
    @dataclass
    class MyDataclass:
      my_thingy = Thingy()
    
    print("MyDataclass has been defined, but no instances")
    a = MyDataclass()
    b = MyDataclass()
    print("Are they the same instance?", a.my_thingy is b.my_thingy)
    

    Output:

    Thingy has been defined
    a Thingy is born
    MyDataclass has been defined, but no instances
    Are they the same instance? True
    

    Notice how "a Thingy is born" is only printed once, during the definition of the MyDataclass class, not when a new instance of MyDataclass is created.

    For a trie, you want every node to have its own dictionary of children, and with this approach, modifying the children dictionary for one node will modify it for them all because all their dictionaries would be the same object. I would reccomend you use dataclass.field to create a new dictionary for each TrieNode, like so:

    from dataclasses import dataclass, field
    from collections import defaultdict
    
    @dataclass
    class TrieNode:
      is_word = False
      children: defaultdict[str, 'TrieNode'] = field(
          default_factory=lambda: defaultdict(TrieNode)
      )