pythonpandasdataframeperformanceadjacency-matrix

Effectively create adjecency matrix from lists of neighbors from multiple CSVs


I have a large collection of CSVs, each containing

Values from id do not occur in any neighbors lists. Values in id are unique per CSV, no CSV contains all the id values, and if two CSV share an id value, then the rows refer to the same object.

I would like to create something akin to a bipartite adjacency matrix from these CSVs, with a row for each each id value i, a column for each neighbors string j, and a 1 in cell (i,j) if there in any of the CSVs exists a row with id value i where string j occurs in the neighbors list.

The code below does what it's supposed to, but it takes very long.

Is there a more effective way of getting the job done?

import pandas as pd
adjacency_matrix = pd.DataFrame()
list_of_csv_files = [csv_file1, csv_file2, ...]

for file in list_of_csvs:
    df = pd.read_csv(file)
    df.set_index('id', inplace = True)

    for i in df.index:
        for j in df.at[i,'neighbors']:
            adjacency_matrix.at[i, j] = 1

Example:

Given that the csvs are loaded to the list list_of_dataframes = [df1,df2] with

df1 = pd.DataFrame(data={'id':['11','12'], 'N': [['a'], ['a', 'b']]})

    id  neighbors
0   11  [a]
1   12  [a, b]

df2 = pd.DataFrame(data={'id':['11','13'], 'N': [['c'], ['d']]})

    id  neighbors
0   11  [c]
1   13  [d]

I seek the dataframe

    a   b   c   d
11  1   NaN 1   NaN
12  1   1   NaN NaN
13  NaN NaN NaN 1

Solution

  • Yes there is a cleaner and efficient way using concat, explode and get_dummies. You can try this:

    import pandas as pd
    list_of_csv_files = [csv_file1, csv_file2, ...]
    list_of_dfs = [pd.read_csv(file) for file in list_of_csv_files]
    out = pd.concat(list_of_dfs)
    out = out.explode('neighbors').drop_duplicates(ignore_index=True)
    out.set_index('id', inplace=True)
    out = pd.get_dummies(out, prefix='', prefix_sep='')
    out = out.groupby('id').sum()