pythondata-structures

Handling complex nested data structures with recursion – performance issues with deep nesting


I’m working on a Python project where I need to process a nested data structure. The structure consists of lists and dictionaries, and the nesting level can vary from a few levels to potentially hundreds. I need to flatten this data structure into a single list while preserving the values. However, I am facing performance issues when dealing with deep nesting.

Here is the simplified data structure I’m working with:

data = {
    "name": "John",
    "contacts": [
        {
            "type": "email",
            "value": "john@example.com",
        },
        {
            "type": "phone",
            "value": [
                {
                    "country": "US",
                    "number": "123-456-7890"
                },
                {
                    "country": "UK",
                    "number": "987-654-3210"
                }
            ]
        }
    ],
    "address": {
        "city": "New York",
        "postal_code": "10001",
        "coordinates": [
            {
                "lat": 40.7128,
                "lon": -74.0060
            }
        ]
    }
}

I need to create a function that will flatten this structure such that all values are extracted into a single list. The output for the above input would look something like:

["John", "email", "john@example.com", "phone", "123-456-7890", "US", "987-654-3210", "UK", "New York", "10001", 40.7128, -74.0060]

I’ve tried using recursion, but I’m running into issues with handling very deep structures. Here is my initial attempt:

def flatten(data):
    flat_list = []
    
    if isinstance(data, dict):
        for key, value in data.items():
            flat_list.extend(flatten(value))
    elif isinstance(data, list):
        for item in data:
            flat_list.extend(flatten(item))
    else:
        flat_list.append(data)
    
    return flat_list

flattened_data = flatten(data)
print(flattened_data)

This works fine for small and medium-sized structures, but when the nesting gets deeper (hundreds of levels deep), I run into recursion depth issues and performance bottlenecks.

What I’ve Tried:
Questions:
  1. How can I improve the recursion or refactor this code to handle much deeper structures efficiently?
  2. Is there an iterative way to flatten this data structure without running into recursion depth limitations?
  3. Are there any known libraries or patterns that can handle very deep and complex data structures like this more efficiently

The structure is dynamic and may not always follow the same pattern (dictionaries may not always contain the same keys, lists may not always contain the same types of data), so the function should be as generic as possible.


Solution

  • First of all, if you have nested lists with mostly 2 entries, and dicts with mostly 2 keys, and your nesting has an average depth of about one hundred levels deep, you have a number of values in the order of 2100, i.e. ~1030. Even if we only count 4 bytes per collected (leaf) value, that represents more volume than today's computers can store, and even more than the whole internet holds at the time of writing.

    Either your nesting is not really that deep, but your program suffers from infinite recursion because the data has cyclic references, or your hierarchy is really narrow where the number of long root-to-leaf paths is not that huge.

    You could avoid some allocation by using generators instead of collecting all values in a list.

    If the size of the call stack is still a problem, even when you have ensured your data does not have cyclic references, then you can always go for an iterative version:

    # helper function
    def get_iterator(data):
        if isinstance(data, list):
            return iter(data)
        elif isinstance(data, dict):
            return iter(data.values())
    
    def flatten(data):
        stack = [iter([data])]
        while stack:
            try:
                value = next(stack[-1])
                iterator = get_iterator(value)
                if iterator:
                    stack.append(iterator)
                else:
                    yield value
            except StopIteration:
                stack.pop()