You probably don’t need A/B testing

The best way to optimise your website is usually the simplest.

March 22, 2021

Illustration by @popiumworks
Illustration by @popiumworks

More than ever, I speak with businesses that have ambitions of setting up their own optimisation programs. They’ve heard the success stories, swallowed the hype and are ready to strap their conversion rate on a rocket to the outer reaches of the stratosphere.

Most of the time, though, I have to break it to them that A/B testing is probably not the best solution. Running an effective A/B testing program requires a significant investment in process, people and structure. Achieving a good ROI demands scale above all else. When businesses get these fundamentals wrong, they’re liable to flounder about for a year or two without achieving any much at all. Worse still, these initiatives can forever tarnish A/B testing as “something we tried once that didn’t work for us”.

So, how can you tell if A/B testing is right for you? In this article, I’m going to outline the criteria that you need to set up an effective program. If your business doesn’t fit the bill, I’ll suggest a powerful alternative that you can use instead.

Optimisation starts with research

There is a popular misconception that A/B testing tools are miraculous money printing machines that turn average websites into winners, no human intervention required.

If you believe the breathless case studies promoted by enthusiastic practitioners and software vendors alike, A/B testing involves slicing your site into a zillion pieces then dynamically recombining it into every different conceivable combination.

This game of algorithmic bingo comes to an end when the software magically finds the winning combination. Green buttons have become blue, hero images are swapped around and now the conversion rate has increased by 8000%. WRRRONG!

This misconception of ‘pure’ algorithmic optimisation cues periodic outrage from UX designers who see themselves competing against a faceless optimisation machine. “You can’t incrementally A/B test your way to perfection!” they argue. And they are right. Good A/B testing doesn’t work like this at all.

The greatest A/B tests almost invariably have their origins in qualitative insight. When someone strikes conversion gold by changing a button colour, it’s not by accident. It’s because they addressed an existing problem. More than likely, this was identified by sitting down and closely observing the behaviour of real users. They observed that people didn’t notice the existing button (normally due to insufficient size or contrast), fixed it with an A/B test and graduated to the next stage of having a less awful website. The colour change was the means to fix the problem, not the problem itself.

Immature A/B testing programs often waste months or years cargo culting these apocryphal tests from Christmases past without seeing any proper results. All the while, they’re missing the veritable gold mine of insights right beneath their noses in plain old, boring qualitative user research.

Sitting down and watching real users try to make sense of your website is the quickest, easiest, most powerful way to unlock transformative insights. Once you combine these insights with hypothesis-driven optimisation, validated at scale… Well, that’s when the magic happens. However, A/B testing comes with many caveats. It’s not always the best way to optimise your website and depending on your circumstances, may even be an enormous waste of time.

You need a lot of traffic to A/B test

Statistical significance is one of the most frequently misunderstood aspects of experimentation. I’m one of the few people I know who didn’t have to endure a Stats class at uni (I was too busy writing cultural studies essays about the intersection of graffiti and third-wave feminist theory) so I probably know less about statistics than you, but humour me for a moment.

Statistical significance is a way to measure the chance that what we have observed in a test is, well, chance. We use a statistical significance score to ensure the validity of the result. I won’t go deeply into the details here, but statistical significance is a product of a handful of intersecting factors:

  • Sample size and duration
  • Baseline conversion rate
  • Minimum detectable effect

Let’s say you’re an eCommerce store and your existing conversion rate is 2.5%.

If you were to deploy an A/B test that increased the conversion rate by 5% (minimum detectable effect) and you wanted to achieve 99% significance (a 1/100 chance that the observed result was due to other factors aside from your test), you would need 330,000 visitors per variation (sample size).

That’s 660k visitors to test just a single variation against a Control. That’s a lot!

You need a lot of traffic to pull it off in any reasonable sort of time.

After you’ve fixed all the seriously broken stuff and plucked the ‘low hanging fruit’, a 5% conversion rate uplift is fairly rare.

So, you’re going to meet more traffic still.

What about a 2% uplift?

This is still a great result!

But in that case, you’d need almost 5m visitors to test a single variation.

If you’re saying, “But that would take forever! And all for a measly 2% uplift!” then you’re beginning to understand something important: you need to have significant scale to A/B test efficiently.

If you don’t have enough traffic, you’ll make the classic mistake of letting your test run too long with no chance of ever getting a good result. For good A/B testing, you need so much traffic that putting half a million users through a test

Most of the tests are not going to give you a big uplift, so you need to be continually refining and iterating and running new tests. But when you do have that much traffic, your revenues are significant enough that a 2% uplift isn’t trivial. For example, one of my clients that achieved a 2.7% conversion uplift recently has projected that this change will make them about $20 million over the next 12 months.

For the websites that do have enormous fire-hoses of traffic, A/B testing is a truly magical proposition. For everyone else, you need a better solution.

Without scale, use qualitative research

Just because you don’t have the scale for A/B testing, it doesn’t mean that you can’t optimise your website. The best A/B testing programs find their insights in qualitative research with 5-10 users and then validate these findings at scale with thousands or millions of visitors. If you don’t have that kind of scale, you can still undertake the research and get some transformative results.

One of the most appealing aspects of user research is that it can be as simple or as complex as you want it to be. You don’t need to have a dedicated usability ‘lab’ with eye-tracking sensors and a team of researchers taking notes from the other side of two-way glass. Sitting down with a friend or colleague and asking them to verbalise their thoughts as they complete a task on your website is a surprisingly powerful method of unearthing cringe-worthy, why-didn’t-I-think-of-that-sooner insights.

Some of the most transformative optimisations of my career have been achieved with learnings from remote, unmoderated user research tools. These are quick, cheap and appropriate for use in the middle of a pandemic. You set some criteria for the users you want to test on (demographic screeners or Q&A’s), set them a task and within an hour or two you will receive a screencast video of users completing the task while verbalising their thoughts.

If you don’t have the traffic for A/B testing, you can just make the changes and watch the results. The problems that you uncover initially will be so significant that you’ll probably see a step-change in your conversion rate, no testing required.

Don’t be disheartened if you don’t have the scale for A/B testing just yet. Optimising your website through user testing can be just as powerful, interesting and rewarding. You’ll have the benefit of being deeply focused on your users and their needs without the layer of abstraction brought by some of the popular A/B testing misconceptions.

Thanks to Karthee Madasamy, Ergest Xheblati and Sarah Ramsey for reviewing this post in draft form.