pythonpandasdataframedictionarynetworkx

Efficient way creating a dict of dict from a pandas dataframe


I have a pandas dataframe of the following structure:

d = {'I': ['A', 'B', 'C', 'D'], 'X': [ 1, 0, 3, 1], 'Y': [0, 1, 2, 1], 'Z': [1, 0, 0, 0], 'W': [3, 2, 0, 0]}
df = pd.DataFrame(data=d, columns=['I','X', 'Y', 'Z', 'W'])
df.set_index('I', inplace=True, drop=True)

I need to create a dict of dict to get data of all existing edges (indicated by nonzero values) between nodes:

{'A': {'X': {1}, 'Z': {1}, 'W': {3}}, 'B': {'Y': {1}, 'W': {2}}, 'C': {'X': {3}, 'Y': {2}}, 'D': {'Y': {1}, 'X': {1}}}

I need it to create a network graph using Networkx library and perform some calculations on it. Obviously it would be possible to loop over every cell in the data frame to do this but my data is quite large and it would be inefficient. I'm looking for some better way possibly using vectorization and/or list comprehension. I've tried list comprehension but I'm stuck and cannot make it work. Can anyone suggest a more efficient way to do this please?


Solution

  • You can do this by combining df.iterrows() with a dictionary comprehension. Although iterrows() is not truly vectorized, it's still reasonably efficient for this kind of task and cleaner than using manual nested loops. For example, you could write:

    edge_dictionary = {
        node: {attribute: {weight} for attribute, weight in attributes.items() if weight != 0}
        for node, attributes in df.iterrows()
    }
    

    If your DataFrame is very large and you’re concerned about performance, another approach is to first convert it into a plain dictionary of dictionaries using df.to_dict(orient='index') and then filter out the zeros. That would look like thiss:

    data_dictionary = df.to_dict(orient='index')
    edge_dictionary = {
        node: {attribute: {weight} for attribute, weight in connections.items() if weight != 0}
        for node, connections in data_dict.items()
    }