data-scienceab-testinghypothesis-test

What is the most conclusive way to evaluate an n-way split test where n > 2?


I have plenty of experience designing, running and evaluating two-way split tests (A/B Tests). Those are by far the most common in digital marketing, where I do most of my work.

However, I'm wondering if anything about the methodology needs to change when more variants are introduced into an experiment (creating, say, a 3-way test (A/B/C Test)).

My instinct tells me I should just run n-1 evaluations against the control group.

If I run a 3-way split test, for example, instinct says I should find significance and power twice:

  1. Treatment A vs Control
  2. Treatment B vs Control

So, in that case, I'm finding out which, if any, treatment performed better than the control (1-tailed test, alt: treatment - control > 0, the basic marketing hypothesis).

But, I'm doubting my instinct. It's occurred to me that running a third test contrasting Treatment A with Treatment B could yield confusing results.

For example, what if there's not enough evidence to reject a null that treatment B = treatment A?

That would lead to a goofy conclusion like this:

  1. Treatment A = Control

  2. Treatment B > Control

  3. Treatment B = Treatment A

If treatments A and B are likely only different due to random chance, how could only one of them outperform the control?

And that's making me wonder if there's a more statistically sound way to evaluate split tests with more than one treatment variable. Is there?


Solution

  • Your instincts are correct, and you can feel less goofy by rewording your statements:

    1. We could find no statistically significant difference between Treatment A and Control.
    2. Treatment B is significantly better than Control.
    3. However it remains inconclusive whether Treatment B is better than Treatment A.

    This would be enough to declare Treatment B a winner, with the possible followup of retesting A vs B. But depending on your specific situation, you may have a business need to actually make sure Treatment B is better than Treatment A before moving forward and you can make no such decision with your data. You must gather more data and/or restart a new test.

    What I've found is a far more common scenario is Treatment A and Treatment B both soundly beat control (as they're often closely related and have related hypotheses), but there is no statistically significant difference between Treatment A or Treatment B. This is an interesting scenario where if you are required to pick a winner, it's okay throwing significance out the window and picking the one that has the strongest effect. The reason why is that the significance level (eg. 95%) is set to avoid false positives and making unnecessary changes. There's an assumption that there are switching costs. In this case, you must pick A or B and throw out control, so in my opinion it's okay picking the best one until you have more data.