The “gold standard” has failure modes that seem to be ignored.
E.g.: making UI elements jump around unpredictably after a page load may increase the number of ad clicks simply because users can’t reliably click on what they actually wanted.
I see A/B testing turning into a religion where it can’t be argued with. “The number went up! It must be good!”
That’s generally because the metrics you are looking at do not represent what users care about. It’s different than the testing methodology, often overlooked, and a lot more important.
I’ve argued that A/B testing training should focus on that skill a lot more than Welch’s theory, but I had to record my own classes for that to happen.
But those metrics are hard to move, so you target secondary metrics.
The problem with that strategy becomes obvious when you spell out the consequences: measurably improving the product is hard, so you measure something else and hope you get product improvements.
E.g.: making UI elements jump around unpredictably after a page load may increase the number of ad clicks simply because users can’t reliably click on what they actually wanted.
I see A/B testing turning into a religion where it can’t be argued with. “The number went up! It must be good!”