I have plenty of experience designing, running and evaluating two-way split tests (A/B Tests). Those are by far the most common in digital marketing, where I do most of my work.
However, I'm wondering if anything about the methodology needs to change when more variants are introduced into an experiment (creating, say, a 3-way test (A/B/C Test)).
My instinct tells me I should just run n-1 evaluations against the control group.
If I run a 3-way split test, for example, instinct says I should find significance and power twice:
So, in that case, I'm finding out which, if any, treatment performed better than the control (1-tailed test, alt: treatment - control > 0, the basic marketing hypothesis).
But, I'm doubting my instinct. It's occurred to me that running a third test contrasting Treatment A with Treatment B could yield confusing results.
For example, what if there's not enough evidence to reject a null that treatment B = treatment A?
That would lead to a goofy conclusion like this:
Treatment A = Control
Treatment B > Control
Treatment B = Treatment A
If treatments A and B are likely only different due to random chance, how could only one of them outperform the control?
And that's making me wonder if there's a more statistically sound way to evaluate split tests with more than one treatment variable. Is there?
Your instincts are correct, and you can feel less goofy by rewording your statements:
This would be enough to declare Treatment B a winner, with the possible followup of retesting A vs B. But depending on your specific situation, you may have a business need to actually make sure Treatment B is better than Treatment A before moving forward and you can make no such decision with your data. You must gather more data and/or restart a new test.
What I've found is a far more common scenario is Treatment A and Treatment B both soundly beat control (as they're often closely related and have related hypotheses), but there is no statistically significant difference between Treatment A or Treatment B. This is an interesting scenario where if you are required to pick a winner, it's okay throwing significance out the window and picking the one that has the strongest effect. The reason why is that the significance level (eg. 95%) is set to avoid false positives and making unnecessary changes. There's an assumption that there are switching costs. In this case, you must pick A or B and throw out control, so in my opinion it's okay picking the best one until you have more data.