pythonpandasstatisticssimulationab-testing

A/B testing using t test


I am running some simulations to test changes to the current model, Let's call the changes as new algorithm. My model predicts the route for a particular transaction between two possible routes. The success rate is defined as total success/ total transactions. The below dataframe has two columns old and new. old has daily success rates for 14 days through the old algorithm and new has daily success rates for 14 days from the new algorithm,

Q1. I want to come to a conclusion as to whether the new algorithm is better than the old algorithm, I can just compare the mean from 14 days and compare but I want to run some statistical measures. I have written the below code but if I interchange new and old columns it still yields the same p value. I basically want to come to the conclusion that new is better than old, but I think this test is telling me that the results from both algorithms are significantly different from each other. need some help to reach the conclusion

Q2. Can I tell a confidence interval in which my results (difference b/w old and new) can lie?

import pandas as pd
from scipy import stats

data = pd.DataFrame({
    'old': [74.9254,73.7721,73.6018,68.6855,63.4666,63.9204,70.6977,62.6488,67.8088,70.2274,71.1197,64.8925,73.1113,70.7065],  # Replace with your old algorithm results
    'new': [74.8419,73.7548,73.6677,68.9352,63.8387,64.1143,70.9533,62.6026,67.9586,70.7,71.1263,65.1053,72.9996,70.5899],
})

# Perform a paired t-test
t_statistic, p_value = stats.ttest_rel(data['new'], data['old'])

# Define your significance level (alpha)
alpha = 0.05

# Print the t-statistic and p-value
print(f"Paired t-statistic: {t_statistic}")
print(f"P-value: {p_value}")

# Compare p-value to the significance level
if p_value < alpha:
    print("Reject the null hypothesis. The new algorithm is performing significantly better.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference between the algorithms.")

Solution

  • You've correctly identified that you should use a paired t-test for this comparison since you're comparing two related groups. I'll address your questions in order:

    Q1: The reason you're getting the same p-value when you swap 'new' and 'old' is because the p-value tests the null hypothesis that the means are the same. The t-statistic will change sign, but the p-value remains the same. What matters in your case is the sign of the t-statistic. If it's positive, it indicates that the new algorithm has a higher mean than the old one. If it's negative, it indicates that the old algorithm has a higher mean than the new one.

    So, modify the decision logic in this way:

    if p_value < alpha:
        if t_statistic > 0:
            print("Reject the null hypothesis. The new algorithm is performing significantly better.")
        else:
            print("Reject the null hypothesis. The old algorithm is performing significantly better.")
    else:
        print("Fail to reject the null hypothesis. There is no significant difference between the algorithms.")
    

    Q2: Yes, you can compute the confidence interval of the difference. For example:

    # Compute the difference and its standard error
    difference = data['new'] - data['old']
    se = difference.std() / (len(difference)**0.5)
    
    # Compute the confidence interval
    ci_low = difference.mean() - (se * stats.t.ppf(1 - (alpha / 2), len(difference) - 1))
    ci_high = difference.mean() + (se * stats.t.ppf(1 - (alpha / 2), len(difference) - 1))
    
    print(f"Confidence Interval (95%): ({ci_low:.4f}, {ci_high:.4f})")
    

    In this case, you've calculated a 95% confidence interval, which corresponds to an alpha level of 0.05. If this interval doesn't contain 0, then that provides further evidence to reject the null hypothesis (similar to the p-value test). Moreover, the interval gives you a range of plausible values for the difference between the two algorithms' means.