{"id":1005534,"date":"2026-04-28T13:43:38","date_gmt":"2026-04-28T11:43:38","guid":{"rendered":"https:\/\/www.iterates.be\/?p=1005534"},"modified":"2026-04-17T14:08:06","modified_gmt":"2026-04-17T12:08:06","slug":"the-7-pitfalls-that-invalidate-your-a-b-tests-and-how-to-avoid-them","status":"publish","type":"post","link":"https:\/\/www.iterates.be\/en\/the-7-pitfalls-that-invalidate-your-a-b-tests-and-how-to-avoid-them\/","title":{"rendered":"A\/B testing: the 7 pitfalls that invalidate your tests"},"content":{"rendered":"<div class=\"vgblk-rw-wrapper limit-wrapper\">\n<p>L\u2019<strong>A\/B testing<\/strong> is reassuring. We launch two versions, wait for the figures, then choose the winner. Scientific. <em>Data-driven<\/em>. Objective. Except that, in practice, the majority of A\/B tests are invalid - not because the teams are incompetent, but because very specific errors creep into the process. The result: bad decisions made with false confidence. Here are the seven most common pitfalls, and how to avoid them in practice.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Design errors that distort your results from the outset<\/h2>\n\n\n\n<p>Even before looking at your data, the damage is often already done. Two structural errors can invalidate a test as soon as it is set up.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Stop the test too soon (Peek &amp; Stop)<\/h3>\n\n\n\n<p>You launch your test on a Monday. On Wednesday <strong>dashboard<\/strong> posted +14 % in conversions with a <strong>p-value<\/strong> of 0.03. You stop and deploy. Classic mistake.<\/p>\n\n\n\n<p>During a test, results naturally fluctuate. If you check frequently and stop as soon as the threshold is crossed, your <strong>false positive rate<\/strong> (errors in <strong>type I<\/strong>) is exploding. Simulations show that this rate can exceed 50 % - even with a theoretical threshold of 95 %.<\/p>\n\n\n\n<p>The solution: before launching the test, define the duration of the test and the <strong>sample size<\/strong> then stick to it. If you need to keep an eye on things along the way, use suitable methods such as the <strong>Sequential Testing<\/strong> or the\u2019<strong>Always Valid Inference<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Testing too many variants simultaneously<\/h3>\n\n\n\n<p>Testing five variants at the same time seems effective. In reality, each additional comparison creates its own probability of <strong>false positive<\/strong>. With five comparisons at p &lt; 0.05, the probability of obtaining at least one false positive exceeds 22 %. This is the problem with <strong>multiple comparisons<\/strong> - also known as <strong>Type I error rate<\/strong>.<\/p>\n\n\n\n<p>The solution: test one or two variants at a time if your traffic is limited. If you test several, apply a statistical correction such as the <strong>Bonferroni correction<\/strong> or the <strong>Benjamini-Hochberg<\/strong>. And above all: formulate a clear hypothesis <em>before<\/em> to test. A test without a hypothesis is exploration disguised as experimentation.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. Statistical traps that inflate your results<\/h2>\n\n\n\n<p>Even with a well-designed test, statistical errors can lead you to believe in gains that don't exist - or miss those that do.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">A sample that is too small or miscalculated<\/h3>\n\n\n\n<p>La <strong>statistical power<\/strong> measures the probability of detecting a real effect if it exists. A test at 50 % of power misses half of the real effects - it's a coin toss with an extra step.<\/p>\n\n\n\n<p>To calculate the right sample size, you need three parameters: the <strong>base rate<\/strong> of your metric, the <strong>Minimum Detectable Effect (MDE)<\/strong> and the <strong>power level<\/strong> (generally 80 %). An undersized test has two perverse effects: it misses the real positive effects (<strong>false negatives<\/strong>), and when it detects something, the estimated effect is often inflated - this is the <strong>winner's curse<\/strong>.<\/p>\n\n\n\n<p>The solution: calculate the sample size <em>before<\/em> to launch, using tools such as Evan Miller's calculator or the Python library <code>statsmodels<\/code>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Confusing statistical significance with practical importance<\/h3>\n\n\n\n<p>A result can be statistically <strong>significant<\/strong> and practically useless. p = 0.01 with a gain of +0.3 % in conversion rate: is this worth the development cost and the associated technical debt? Probably not.<\/p>\n\n\n\n<p>La <strong>statistical significance<\/strong> answers : \u201cIs this effect due to chance?\u201d It does not answer: \u201cIs this effect large enough to matter?\u201d To do this, use measures of\u2019<strong>effect size<\/strong> - <strong>Cohen's d<\/strong>, <strong>relative lift<\/strong>, The absolute impact projected onto your user base. Define a <strong>Minimum Business Impact<\/strong> What is the smallest result that would justify implementing the change? It's this bar that should guide your decisions, not just the <strong>p-value<\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. Behavioural and measurement biases that lead to misinterpretation<\/h2>\n\n\n\n<p>A test can be technically valid and statistically correct - and still mislead you. This is where user bias and metrics come in.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The novelty effect and Sample Ratio Mismatch<\/h3>\n\n\n\n<p>Two distinct phenomena can corrupt your conclusions at this stage. The first is the\u2019<strong>novelty effect<\/strong> a variant performs better simply because it is <em>different<\/em>. Users explore out of curiosity and click more - but this effect is transitory. If you measure too early, your apparent gain will disappear once the novelty wears off. Conversely\u2019<strong>Hawthorne effect<\/strong> causes users who feel they are being watched to behave differently. To distinguish a real gain from an artefact, monitor the evolution of the metric over time and wait at least two weeks before concluding on a major interface change.<\/p>\n\n\n\n<p>The second phenomenon is the <strong>Sample Ratio Mismatch (SRM)<\/strong> The ratio of users between the variants does not correspond to the expected ratio. If your test is designed on a 50\/50 basis and you obtain 52\/48, all your conclusions are invalid. Before any analysis, carry out a <strong>chi-square test<\/strong> on your group sizes. Common causes include <strong>tracking<\/strong>, s, misconfigured redirects and <strong>bots<\/strong> filtered after assignment, or a <strong>cross-contamination<\/strong> between groups.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Poorly chosen metrics<\/h3>\n\n\n\n<p>This is the most strategic trap. A perfectly executed test can lead to the wrong decision if you measure the wrong thing. Classic example: you optimise the <strong>click-through rate<\/strong> on an add to basket button. It goes up. But the <strong>purchase completion rate<\/strong> fall. You've created friction downstream without realising it.<\/p>\n\n\n\n<p>All good tests must include <strong>guardrail metrics<\/strong> - metrics that you are not trying to improve, but do not want to deteriorate: revenue per user, <strong>retention rate<\/strong>, NPS, loading time. If a guardrail metric deteriorates, the \u201cwinning\u201d variant on the <strong>main metric<\/strong> may not be a win. Before you launch, ask yourself this question: \u201cIf this metric increases by X %, am I sure that's good for business?\u201d If the answer is \u201cnot necessarily\u201d, look for a better metric.<\/p>\n\n\n\n<p><\/p>\n<\/div><!-- .vgblk-rw-wrapper -->","protected":false},"excerpt":{"rendered":"<p>There's something reassuring about A\/B testing. You launch two versions, wait for the figures, then choose the winner. Scientific. Data-driven. Objective. Except that, in practice, the majority of A\/B tests are invalid - not because the teams are incompetent, but because very specific errors creep into the process. Results<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1238,1245,1239],"tags":[],"class_list":["post-1005534","post","type-post","status-publish","format-standard","hentry","category-developpement-dapplication","category-developpement-de-logiciel","category-developpement-web"],"acf":[],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.iterates.be\/en\/wp-json\/wp\/v2\/posts\/1005534","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.iterates.be\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.iterates.be\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.iterates.be\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.iterates.be\/en\/wp-json\/wp\/v2\/comments?post=1005534"}],"version-history":[{"count":1,"href":"https:\/\/www.iterates.be\/en\/wp-json\/wp\/v2\/posts\/1005534\/revisions"}],"predecessor-version":[{"id":1005558,"href":"https:\/\/www.iterates.be\/en\/wp-json\/wp\/v2\/posts\/1005534\/revisions\/1005558"}],"wp:attachment":[{"href":"https:\/\/www.iterates.be\/en\/wp-json\/wp\/v2\/media?parent=1005534"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.iterates.be\/en\/wp-json\/wp\/v2\/categories?post=1005534"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.iterates.be\/en\/wp-json\/wp\/v2\/tags?post=1005534"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}