I have selected a subset of numerical columns from a database and I want to iterate through the columns selecting a target_column
and comparing it with the result of a numerical operation between two other columns in the dataframe. However, I am unsure as to how to compare the result (e.g. col1 * col2 = target_column
).
# For all possible combinations of numeric columns
for col1, col2 in combinations(numeric_cols, 2):
# For a target column in numeric_columns
for target_column in numeric_cols:
# Skip if the target column is one of the relationship columns
if target_column in (col1, col2):
continue
Edit: I have worked something out, but I'm still unsure if this is the most efficient way to do it
def analyse_relationships(df):
numeric_cols = df.select_dtypes(include=[np.number])
threshold = 0.001
relationships = []
# For all possible combinations of numeric columns
for col1, col2 in combinations(numeric_cols, 2):
# For a target column in numeric_columns
for target_column in numeric_cols:
# Skip if the target column is one of the relationship columns
if target_column in (col1, col2):
continue
# Calculate different operations
product = numeric_cols[col1] * numeric_cols[col2]
sum_cols = numeric_cols[col1] + numeric_cols[col2]
diff = numeric_cols[col1] - numeric_cols[col2]
if np.allclose(product, numeric_cols[target_column], rtol=threshold):
relationships.append(f"{col1} * {col2} = {target_column}")
elif np.allclose(sum_cols, numeric_cols[target_column], rtol=threshold):
relationships.append(f"{col1} + {col2} = {target_column}")
elif np.allclose(diff, numeric_cols[target_column], rtol=threshold):
relationships.append(f"{col1} - {col2} = {target_column}")
To solve your problem I strongly suggest you to vectorize your data and use as few Pandas operations as possible, since a lot of operations are required and, consequently, the more we can rely solely on NumPy the faster the code will run (NumPy's core is written in C).
Since there are multiple operations to take in account (+
, -
, *
, /
) we need to calculate each result and compare it with target_column
, but we can be more space efficient by using a boolean mask (i.e. a NumPy array), that is a column that represents, as a boolean value, the expression target_column == col1 operation col2
. Note that, having possible float in the df
, it's actually better to use numpy.isclose()
to which we can give a treshold for the float operations.
Your code could be something along these lines (I included a little example at the end):
import numpy as np
import pandas as pd
from itertools import combinations
def detect_relations(df, numeric_cols, float_tolerance=1e-8):
results = []
ops = {
"+": [(lambda x, y: x + y, "a + b")], # No need to have `b + a` since sum is commutative
"-": [
(lambda x, y: x - y, "a - b"),
(lambda x, y: y - x, "b - a")
],
"*": [(lambda x, y: x * y, "a * b")], # Same as sum, `a * b = b * a`
"/": [
# We need to ensure that we don't divide by 0
(lambda x, y: np.divide(x, y, out=np.full_like(x, np.nan, dtype=float), where=(y != 0)), "a / b"),
(lambda x, y: np.divide(y, x, out=np.full_like(x, np.nan, dtype=float), where=(x != 0)), "b / a")
],
}
# Iterating trough every combination
for col1, col2 in combinations(numeric_cols, 2):
a = df[col1].values
b = df[col2].values
# Iterating trough each possible operation
for _, functions in ops.items():
for func, op_name in functions:
# Calculating the result of the operation `func` between col1 and col2
val = func(a, b)
# Confronting the result of `col1 operation col2` to the other numeric columns (avoiding col1 and col2)
for target in numeric_cols:
if target in (col1, col2):
continue
c = df[target].values
# We get a boolean mask with the comparison between `col1 operation col2` and `target_column`
mask = np.isclose(val, c, atol=float_tolerance, equal_nan=False)
# Counting how many relations we found
matches = int(mask.sum())
total = len(df)
results.append({
"col1": col1,
"col2": col2,
"operation": op_name,
"target": target,
"matches": matches,
"total": total,
"pct_match": matches / total
})
return pd.DataFrame(results)
# --- Example---
df = pd.DataFrame({
"a": [1, 2, 3, 4],
"b": [2, 2, 2, 2],
"c": [2, 4, 6, 8],
"d": [2, 0, 1, 8],
})
numeric_cols = ["a", "b", "c", "d"]
res = detect_relations(df, numeric_cols)
# Avoid to print combinations with no relation
print(res[res["matches"] > 0].sort_values("pct_match", ascending=False))
Output:
col1 col2 operation target matches total pct_match
6 a b a * b c 4 4 1.00
22 a c b / a b 4 4 1.00
46 b c b / a a 4 4 1.00
7 a b a * b d 2 4 0.50
48 b d a + b a 2 4 0.50
26 a d a - b b 2 4 0.50
58 b d b / a a 2 4 0.50
34 a d b / a b 2 4 0.50
3 a b a - b d 2 4 0.50
5 a b b - a d 1 4 0.25
0 a b a + b c 1 4 0.25
19 a c a * b d 1 4 0.25
18 a c a * b b 1 4 0.25
16 a c b - a b 1 4 0.25
11 a b b / a d 1 4 0.25
10 a b b / a c 1 4 0.25
30 a d a * b b 1 4 0.25
24 a d a + b b 1 4 0.25
23 a c b / a d 1 4 0.25
40 b c b - a a 1 4 0.25
35 a d b / a c 1 4 0.25
31 a d a * b c 1 4 0.25
44 b c a / b a 1 4 0.25
50 b d a - b a 1 4 0.25
56 b d a / b a 1 4 0.25
68 c d a / b a 1 4 0.25
70 c d b / a a 1 4 0.25
I'd like to point out that the code above has a certain time complexity and could be slow when the number of elements is large.
The code above has to go trough a number of cycles:
numerical_cols
, that is (m choose 2) = m(m-1)/2 ~ O(m^2)
;target
, i.e. O(m - 2) = O(m)
;numpy.isclose()
cycle trough every element n
and so it has complexity of O(n)
.Considering those cycles, it's easy to see how the code has time complexity of O(m^2 * m * n) = O(n * m^3)
, so the time required will increase cubically w.r.t. the number of numerical columns and linearly w.r.t. the number of elements in each column.