A/B testing: the 7 pitfalls that invalidate your tests

L’A/B testing is reassuring. We launch two versions, wait for the figures, then choose the winner. Scientific. Data-driven. Objective. Except that, in practice, the majority of A/B tests are invalid - not because the teams are incompetent, but because very specific errors creep into the process. The result: bad decisions made with false confidence. Here are the seven most common pitfalls, and how to avoid them in practice.

1. Design errors that distort your results from the outset

Even before looking at your data, the damage is often already done. Two structural errors can invalidate a test as soon as it is set up.

Stop the test too soon (Peek & Stop)

You launch your test on a Monday. On Wednesday dashboard posted +14 % in conversions with a p-value of 0.03. You stop and deploy. Classic mistake.

During a test, results naturally fluctuate. If you check frequently and stop as soon as the threshold is crossed, your false positive rate (errors in type I) is exploding. Simulations show that this rate can exceed 50 % - even with a theoretical threshold of 95 %.

The solution: before launching the test, define the duration of the test and the sample size then stick to it. If you need to keep an eye on things along the way, use suitable methods such as the Sequential Testing or the’Always Valid Inference.

Testing too many variants simultaneously

Testing five variants at the same time seems effective. In reality, each additional comparison creates its own probability of false positive. With five comparisons at p < 0.05, the probability of obtaining at least one false positive exceeds 22 %. This is the problem with multiple comparisons - also known as Type I error rate.

The solution: test one or two variants at a time if your traffic is limited. If you test several, apply a statistical correction such as the Bonferroni correction or the Benjamini-Hochberg. And above all: formulate a clear hypothesis before to test. A test without a hypothesis is exploration disguised as experimentation.

2. Statistical traps that inflate your results

Even with a well-designed test, statistical errors can lead you to believe in gains that don't exist - or miss those that do.

A sample that is too small or miscalculated

La statistical power measures the probability of detecting a real effect if it exists. A test at 50 % of power misses half of the real effects - it's a coin toss with an extra step.

To calculate the right sample size, you need three parameters: the base rate of your metric, the Minimum Detectable Effect (MDE) and the power level (generally 80 %). An undersized test has two perverse effects: it misses the real positive effects (false negatives), and when it detects something, the estimated effect is often inflated - this is the winner's curse.

The solution: calculate the sample size before to launch, using tools such as Evan Miller's calculator or the Python library statsmodels.

Confusing statistical significance with practical importance

A result can be statistically significant and practically useless. p = 0.01 with a gain of +0.3 % in conversion rate: is this worth the development cost and the associated technical debt? Probably not.

La statistical significance answers : “Is this effect due to chance?” It does not answer: “Is this effect large enough to matter?” To do this, use measures of’effect size - Cohen's d, relative lift, The absolute impact projected onto your user base. Define a Minimum Business Impact What is the smallest result that would justify implementing the change? It's this bar that should guide your decisions, not just the p-value.

3. Behavioural and measurement biases that lead to misinterpretation

A test can be technically valid and statistically correct - and still mislead you. This is where user bias and metrics come in.

The novelty effect and Sample Ratio Mismatch

Two distinct phenomena can corrupt your conclusions at this stage. The first is the’novelty effect a variant performs better simply because it is different. Users explore out of curiosity and click more - but this effect is transitory. If you measure too early, your apparent gain will disappear once the novelty wears off. Conversely’Hawthorne effect causes users who feel they are being watched to behave differently. To distinguish a real gain from an artefact, monitor the evolution of the metric over time and wait at least two weeks before concluding on a major interface change.

The second phenomenon is the Sample Ratio Mismatch (SRM) The ratio of users between the variants does not correspond to the expected ratio. If your test is designed on a 50/50 basis and you obtain 52/48, all your conclusions are invalid. Before any analysis, carry out a chi-square test on your group sizes. Common causes include tracking, s, misconfigured redirects and bots filtered after assignment, or a cross-contamination between groups.

Poorly chosen metrics

This is the most strategic trap. A perfectly executed test can lead to the wrong decision if you measure the wrong thing. Classic example: you optimise the click-through rate on an add to basket button. It goes up. But the purchase completion rate fall. You've created friction downstream without realising it.

All good tests must include guardrail metrics - metrics that you are not trying to improve, but do not want to deteriorate: revenue per user, retention rate, NPS, loading time. If a guardrail metric deteriorates, the “winning” variant on the main metric may not be a win. Before you launch, ask yourself this question: “If this metric increases by X %, am I sure that's good for business?” If the answer is “not necessarily”, look for a better metric.

Author
Picture of Rodolphe Balay
Rodolphe Balay
Rodolphe Balay is co-founder of iterates, a web agency specialising in the development of web and mobile applications. He works with businesses and start-ups to create customised, easy-to-use digital solutions tailored to their needs.

You may also like

Similar services

There's something reassuring about A/B testing. You launch two versions,...
Automating repetitive tasks in Brussels - Optimise your...
Your WordPress website agency in Belgium: custom development...