pythonstatisticsanalyticshypothesis-testkolmogorov-smirnov

Ways to determine the statistical significance between two independent datasets


Suppose A and B are two datasets. The datasets might have 100 features each. How do I perform hypothesis testing on these independent datasets to compare statistical significance?

I tried to write a code in Python. I have preprocessed both the datasets and I have tried using Student's t test considering the columns are normalized. The datasets are tabular data with continuous values and have performed one hot encoding on the categorical features. I tried performing t-test on a numerical column from the both datasets. But I can't seem to figure out how to perform on the entire dataset. I used the scipy.stats library.


Solution

  • The Kolmogorov-Smirnov test is a non-parametric statistical test that can be used to determine if two samples come from the same distribution.

    One approach that you can take is for each of the features (columns) from the datasets A and B perform a KS test to check if they have come from the same distribution (using the scipy.stats.ks_2samp() function).

    Th following code shows an example, where it uses couple of 2-column datasets, namely, A and B. The first feature (column) of the datatsets A and B comes (are sampled) from the same (standard normal) distribution, but the second feature comes from different (normal) distributions (with different parmeters).

    import numpy as np
    from scipy.stats import ks_2samp
    
    n = 100 # number of samples
    
    A = np.hstack((np.random.normal(loc=0, scale=1, size=n).reshape(-1,1), \
                   np.random.normal(loc=0, scale=1, size=n).reshape(-1,1)))
    
    B = np.hstack((np.random.normal(loc=0, scale=1, size=n).reshape(-1,1), \
                   np.random.normal(loc=20, scale=5, size=n).reshape(-1,1)))
    

    If you plot the histogram of the features for the datasets, you will obtain a figure like the following:

    enter image description here

    Clearly the second feature is highly likely to be chosen from different distributions. Let's verify with the KS test.

    for i in range(A.shape[1]):
        print(f'Kolmogorov-Smirnov test for feature column {i}')
        statistic, pvalue = ks_2samp(A[:,i], B[:,i])
        print(f"Test statistic: {statistic}")
        print(f"P-value: {pvalue}")
    
    # Kolmogorov-Smirnov test for feature column 0
    # Test statistic: 0.13
    # P-value: 0.36818778606286096  # can't reject H0
    
    # Kolmogorov-Smirnov test for feature column 1
    # Test statistic: 1.0
    # P-value: 2.2087606931995054e-59 # reject H0
    

    As can be seen from above, using the KS test,

    You can use the same approach on your 100-column datasets, by comparing them parewise.