python statistics analytics hypothesis-test kolmogorov-smirnov

Ways to determine the statistical significance between two independent datasets

Suppose A and B are two datasets. The datasets might have 100 features each. How do I perform hypothesis testing on these independent datasets to compare statistical significance?

I tried to write a code in Python. I have preprocessed both the datasets and I have tried using Student's t test considering the columns are normalized. The datasets are tabular data with continuous values and have performed one hot encoding on the categorical features. I tried performing t-test on a numerical column from the both datasets. But I can't seem to figure out how to perform on the entire dataset. I used the scipy.stats library.

Solution

The Kolmogorov-Smirnov test is a non-parametric statistical test that can be used to determine if two samples come from the same distribution.

One approach that you can take is for each of the features (columns) from the datasets A and B perform a KS test to check if they have come from the same distribution (using the scipy.stats.ks_2samp() function).

Th following code shows an example, where it uses couple of 2-column datasets, namely, A and B. The first feature (column) of the datatsets A and B comes (are sampled) from the same (standard normal) distribution, but the second feature comes from different (normal) distributions (with different parmeters).

import numpy as np
from scipy.stats import ks_2samp

n = 100 # number of samples

A = np.hstack((np.random.normal(loc=0, scale=1, size=n).reshape(-1,1), \
               np.random.normal(loc=0, scale=1, size=n).reshape(-1,1)))

B = np.hstack((np.random.normal(loc=0, scale=1, size=n).reshape(-1,1), \
               np.random.normal(loc=20, scale=5, size=n).reshape(-1,1)))

If you plot the histogram of the features for the datasets, you will obtain a figure like the following:

Clearly the second feature is highly likely to be chosen from different distributions. Let's verify with the KS test.

for i in range(A.shape[1]):
    print(f'Kolmogorov-Smirnov test for feature column {i}')
    statistic, pvalue = ks_2samp(A[:,i], B[:,i])
    print(f"Test statistic: {statistic}")
    print(f"P-value: {pvalue}")

# Kolmogorov-Smirnov test for feature column 0
# Test statistic: 0.13
# P-value: 0.36818778606286096  # can't reject H0

# Kolmogorov-Smirnov test for feature column 1
# Test statistic: 1.0
# P-value: 2.2087606931995054e-59 # reject H0

As can be seen from above, using the KS test,

we can not reject the null hypothesis (at 5% level of significance) that the first feature for the datasets A and B came from the same distribution since the p-value is high (0.368 > 0.05),
we can correctly reject the null hypothesis that the second feature for the datasets A and B came from the same distribution since the p-value is almost 0.

You can use the same approach on your 100-column datasets, by comparing them parewise.