pandasdataframedata-sciencefeature-selectionyellowbrick

Pandas dataframe divide features to group of high correlation


I have a dataframe with over 280 features. I ran correlation map to detect groups of features that are highly correlated: enter image description here Now, I want to divide the features to groups, such that each group will be a "red zone", meaning each group will have features that are all have correlation >0.5 with each other.

How can it be done?

Thanks


Solution

  • Disclaimer:

    1. Visualization is not addressed in this solution. Only groups were found.
    2. The solution is known to be NP-hard, so mind efficiency problems.

    Theory

    The problem is essentially a clique problem in graph theory, which means finding all the complete subgraphs in a given graph (with nodes > 2).

    Imagine a graph that all the features are nodes and pairs of features satisfying corr > 0.5 are edges. Then the task of finding all "groups" requested can simply translates into "finding all complete subgraphs in the graph".

    Code

    The code uses networkx.algorithms.find_cliques for the search task, which implements Bron–Kerbosch algorithm according to the docs.

    The code conprises of two parts. The first part extract the edges using np.triu (modified from this post) and the second part feeds the edge list into networkx.

    The Coorelation Matrix

    Feature [A,B,C] and [C,D,E] are closely correlated respectively, but not between [A,B] and [D,E].

    np.random.seed(111)  # reproducibility
    x = np.random.normal(0, 1, 100)
    y = np.random.normal(0, 1, 100)
    a = x
    b = x + np.random.normal(0, .5, 100)
    c = x + y
    d = y + np.random.normal(0, .5, 100)
    e = y + np.random.normal(0, .5, 100)
    
    df = pd.DataFrame({"A":a, "B":b, "C":c, "D":d, "E":e})
    corr = df.corr()
    
    corr
    Out[24]: 
              A         B         C         D         E
    A  1.000000  0.893366  0.677333 -0.078369 -0.090510
    B  0.893366  1.000000  0.577459 -0.072025 -0.079855
    C  0.677333  0.577459  1.000000  0.587695  0.579891
    D -0.078369 -0.072025  0.587695  1.000000  0.777803
    E -0.090510 -0.079855  0.579891  0.777803  1.000000
    

    Part 1

    # keep only upper triangle elements (excluding diagonal elements)
    mask_keep = np.triu(np.ones(corr.shape), k=1).astype('bool').reshape(corr.size)
    # melt (unpivot) the dataframe and apply mask
    sr = corr.stack()[mask_keep]
    # filter and get names
    edges = sr[sr > 0.5].reset_index().values[:, :2]
    
    edges
    Out[25]: 
    array([['A', 'B'],
           ['A', 'C'],
           ['B', 'C'],
           ['C', 'D'],
           ['C', 'E'],
           ['D', 'E']], dtype=object)
    

    Part 2

    import networkx as nx
    g = nx.from_edgelist(edges)
    ls_cliques = []
    for clique in nx.algorithms.find_cliques(g):
        ls_cliques.append(clique)
    
    # result
    ls_cliques
    Out[26]: [['C', 'A', 'B'], ['C', 'D', 'E']]