pythondataframeloopsnested-loops

Finding Numerical Relationships between Columns


I have selected a subset of numerical columns from a database and I want to iterate through the columns selecting a target_column and comparing it with the result of a numerical operation between two other columns in the dataframe. However, I am unsure as to how to compare the result (e.g. col1 * col2 = target_column).

# For all possible combinations of numeric columns
for col1, col2 in combinations(numeric_cols, 2):
    # For a target column in numeric_columns
    for target_column in numeric_cols:
        # Skip if the target column is one of the relationship columns
        if target_column in (col1, col2):
            continue

Edit: I have worked something out, but I'm still unsure if this is the most efficient way to do it

def analyse_relationships(df):

numeric_cols = df.select_dtypes(include=[np.number])
threshold = 0.001
relationships = []

# For all possible combinations of numeric columns
for col1, col2 in combinations(numeric_cols, 2):
    # For a target column in numeric_columns
    for target_column in numeric_cols:
        # Skip if the target column is one of the relationship columns
        if target_column in (col1, col2):
            continue

        # Calculate different operations
        product = numeric_cols[col1] * numeric_cols[col2]
        sum_cols = numeric_cols[col1] + numeric_cols[col2]
        diff = numeric_cols[col1] - numeric_cols[col2]

        if np.allclose(product, numeric_cols[target_column], rtol=threshold):
            relationships.append(f"{col1} * {col2} = {target_column}")
        elif np.allclose(sum_cols, numeric_cols[target_column], rtol=threshold):
            relationships.append(f"{col1} + {col2} = {target_column}")
        elif np.allclose(diff, numeric_cols[target_column], rtol=threshold):
            relationships.append(f"{col1} - {col2} = {target_column}")

Solution

  • To solve your problem I strongly suggest you to vectorize your data and use as few Pandas operations as possible, since a lot of operations are required and, consequently, the more we can rely solely on NumPy the faster the code will run (NumPy's core is written in C).

    Since there are multiple operations to take in account (+, -, *, /) we need to calculate each result and compare it with target_column, but we can be more space efficient by using a boolean mask (i.e. a NumPy array), that is a column that represents, as a boolean value, the expression target_column == col1 operation col2. Note that, having possible float in the df, it's actually better to use numpy.isclose() to which we can give a treshold for the float operations.

    Your code could be something along these lines (I included a little example at the end):

    import numpy as np
    import pandas as pd
    from itertools import combinations
    
    def detect_relations(df, numeric_cols, float_tolerance=1e-8):
        results = []
        ops = {
            "+": [(lambda x, y: x + y, "a + b")],   # No need to have `b + a` since sum is commutative
            "-": [
                (lambda x, y: x - y, "a - b"),
                (lambda x, y: y - x, "b - a")
            ],
            "*": [(lambda x, y: x * y, "a * b")],   # Same as sum, `a * b = b * a`
            "/": [
                # We need to ensure that we don't divide by 0
                (lambda x, y: np.divide(x, y, out=np.full_like(x, np.nan, dtype=float), where=(y != 0)), "a / b"),
                (lambda x, y: np.divide(y, x, out=np.full_like(x, np.nan, dtype=float), where=(x != 0)), "b / a")
            ],
        }
        # Iterating trough every combination
        for col1, col2 in combinations(numeric_cols, 2):
            a = df[col1].values
            b = df[col2].values
            # Iterating trough each possible operation
            for _, functions in ops.items():
                for func, op_name in functions:
                    # Calculating the result of the operation `func` between col1 and col2
                    val = func(a, b)
                    # Confronting the result of `col1 operation col2` to the other numeric columns (avoiding col1 and col2)
                    for target in numeric_cols:
                        if target in (col1, col2):
                            continue
                        c = df[target].values
                        # We get a boolean mask with the comparison between `col1 operation col2` and `target_column`
                        mask = np.isclose(val, c, atol=float_tolerance, equal_nan=False)
                        # Counting how many relations we found
                        matches = int(mask.sum())
                        total = len(df)
                        
                        results.append({
                            "col1": col1,
                            "col2": col2,
                            "operation": op_name,
                            "target": target,
                            "matches": matches,
                            "total": total,
                            "pct_match": matches / total
                        })
                    
        return pd.DataFrame(results)
    
    
    # --- Example---
    df = pd.DataFrame({
        "a": [1, 2, 3, 4],
        "b": [2, 2, 2, 2],
        "c": [2, 4, 6, 8],
        "d": [2, 0, 1, 8],
    })
    
    numeric_cols = ["a", "b", "c", "d"]
    res = detect_relations(df, numeric_cols)
    # Avoid to print combinations with no relation
    print(res[res["matches"] > 0].sort_values("pct_match", ascending=False))
    

    Output:

       col1 col2 operation target  matches  total  pct_match
    6     a    b     a * b      c        4      4       1.00
    22    a    c     b / a      b        4      4       1.00
    46    b    c     b / a      a        4      4       1.00
    7     a    b     a * b      d        2      4       0.50
    48    b    d     a + b      a        2      4       0.50
    26    a    d     a - b      b        2      4       0.50
    58    b    d     b / a      a        2      4       0.50
    34    a    d     b / a      b        2      4       0.50
    3     a    b     a - b      d        2      4       0.50
    5     a    b     b - a      d        1      4       0.25
    0     a    b     a + b      c        1      4       0.25
    19    a    c     a * b      d        1      4       0.25
    18    a    c     a * b      b        1      4       0.25
    16    a    c     b - a      b        1      4       0.25
    11    a    b     b / a      d        1      4       0.25
    10    a    b     b / a      c        1      4       0.25
    30    a    d     a * b      b        1      4       0.25
    24    a    d     a + b      b        1      4       0.25
    23    a    c     b / a      d        1      4       0.25
    40    b    c     b - a      a        1      4       0.25
    35    a    d     b / a      c        1      4       0.25
    31    a    d     a * b      c        1      4       0.25
    44    b    c     a / b      a        1      4       0.25
    50    b    d     a - b      a        1      4       0.25
    56    b    d     a / b      a        1      4       0.25
    68    c    d     a / b      a        1      4       0.25
    70    c    d     b / a      a        1      4       0.25
    

    Complexity analysis

    I'd like to point out that the code above has a certain time complexity and could be slow when the number of elements is large.

    The code above has to go trough a number of cycles:

    Considering those cycles, it's easy to see how the code has time complexity of O(m^2 * m * n) = O(n * m^3), so the time required will increase cubically w.r.t. the number of numerical columns and linearly w.r.t. the number of elements in each column.