Why I like A/B tests and why most teams shouldn't use them
Yes, the title is provocative. And it did its job: you’re here. In practice, the decision is rarely binary; it depends on context, as most do. This article is about how to make better data-driven decisions.
TL;DR
- An A/B test is a real experiment, not a number comparison.
- It works when the situation fits and the test runs the way it was designed.
- A few common traps quietly distort the result: peeking, cherry-picking, hypothesizing after the fact.
- When it does not fit, UX tests, surveys, or expert judgment often serve better.
Being analytical is a core part of how I think. Facts and measurable results usually beat assumptions, claims, and cargo cult.
A/B tests are a tool for exactly that. But as so often, you have to find the right tool for the right situation. And you have to use the tool correctly. A/B tests are no exception.
What an A/B test actually is
Say a team wants to ship a new checkout button. The old one converts fine, but someone is sure a bolder design will get more people through. They could argue about it in a meeting. Instead they run an A/B test: half the users see the old button (variant A), half see the new one (variant B), assigned at random.
That last part, the random assignment, is what makes it an experiment. Because users are split at random, the two groups are alike on average, so a difference in conversion can be attributed to the button1 rather than to who happened to see it. Without it, the new button might simply have been shown to more ready-to-buy users, and the team would be looking at correlation1 while calling it a result. Randomization is what makes that causal reading possible, but not automatic: the rest of this article is about the conditions needed to actually trust the result. This is also why an A/B test draws on the scientific method2, not a plain number comparison.
Underneath, it is a hypothesis test3. Two statements compete:
- (null hypothesis): the new button changes nothing, both convert the same
- (alternative hypothesis): there is a measurable difference
Notice what the test does not do. It does not prove the new button is better. It collects evidence against and rejects it only when that evidence is strong enough, at an error rate decided in advance.
And this is not a product-analytics invention. The controlled, randomized comparison is one of its core building blocks, and the gold standard for showing cause and effect1. It is the same logic that tells you whether a drug works or a fertilizer helps a field: hold everything constant, change one thing, compare two otherwise identical groups. A/B testing is the same method pointed at a product instead of a lab.
What the team wants out of it is a statement about cause and effect, together with how confident they can be in it. A good result reads almost like a line you could drop into a decision document:
If we ship the new button, checkout conversion goes up by about 2%, and we are 95% confident that effect is real and not noise.
Three things make that sentence useful. It names a cause (the new button), it puts a size on the effect (conversion up by about 2%), and it attaches a degree of confidence to it. Now compare it to what usually gets reported:
The new button’s numbers look better in the analytics dashboard.
This says nothing about cause: maybe B simply got shown to a different mix of users. It says nothing about size: “better” could be 0.1% or 10%. And it says nothing about confidence: the gap could be gone tomorrow. The first sentence is something you can defend in a meeting; the second is a number with an opinion attached.
Flavors of A/B testing
This post focuses on the classical frequentist test, built on vs. , a -value4, and a significance level ()5. It asks how surprising the data would be if the button changed nothing. The Bayesian approach asks a different question: starting from an assumption you state up front (a prior), it uses the data to estimate how likely it is that B actually wins. Two methods, same foundations: every flavor works best with a real hypothesis, enough data, and a stopping rule set in advance.
What usually happens instead
So the checkout team ships both buttons, ready to keep whichever one pulls ahead. That is the intuitive thing to do, and most of us have done it. Comparing the two is the right idea; what is missing is the method around it. Without a hypothesis, a planned sample size6, and an agreed stopping point, the risk is not just a weaker answer but a wrong one: watching and stopping the moment B looks good can hand you a winner that was never real. The good news is that the missing pieces are easy to add.
The first place this tends to go wrong is checking the test too soon. Variant B jumps ahead on day two, someone calls it, and the test is over. Peeking at a running test and stopping the moment it looks good feels responsible, like staying on top of things. The trouble is that an A/B test is noisy by nature, and if you keep looking and stop on the first good-looking moment, you will eventually find one even when both buttons are truly identical. The more often you peek, the easier it is to be misled by chance.
Now suppose conversion comes out flat. A dozen other metrics are still being tracked: time on page, add-to-cart rate, average order value, bounce. Check enough of them and one is bound to come out ahead for variant B. It is tempting to report that one and set the rest aside. The catch is statistical: the more metrics you check, the more likely it is that at least one looks like a winner by pure chance. There are standard ways to correct for this, and the simplest one is to decide which metric counts before the test, not after.
Then there is the story told after the fact. Once someone, for example a decision maker, has a number they like, it is surprisingly easy to invent the hypothesis that fits: of course the bolder button won, users wanted a clearer call to action. The explanation feels like it was the plan all along. This has a name, HARKing, hypothesizing after the results are known, and it is not a product-team failing in particular; trained researchers slip into it too. A hypothesis written after seeing the data is not a prediction. It is a description wearing a prediction’s clothes.
And there's more where that came from
These are not the only ways an A/B test can mislead you. There are more traps in the same spirit, and they share one thing: none of them announce themselves, so you only catch them if you go looking.
When an A/B test fits, and when it doesn’t
An A/B test is worth running when a few things line up. The checkout button is a good candidate: a tiny shop might not get enough orders, but a busy store sees enough to detect a realistic change in conversion within a sensible window. There is a real hypothesis behind it: a bolder button lifts conversion, not a vague “let’s see what happens.” The change is easy to roll back or release gradually, so a bad variant costs little. And the team genuinely wants the answer and will act on it, even if the answer is that the new button does nothing. If you would ship the new button either way, there is nothing to test.
If the shop barely gets orders, there is not enough traffic to tell a real effect from noise, and the test either runs forever or concludes nothing. It is also a poor fit when the decision is already made and the test exists only to put a number behind it. Some questions are simply too big for a single metric: “should we enter market X” or “should we rewrite the app” are strategy, not button design, and no conversion rate will settle them. And sometimes the math would work but the stakes are too high, when running the worse variant on half your users for two weeks would do real damage. A test only helps if the team is willing to be wrong and the cost of finding out is bearable.
When an A/B test does not fit, that is not a dead end, just a different set of tools. Usability tests, user interviews, surveys, plain qualitative observation, or a careful before-and-after comparison can all answer questions a randomized test cannot. For a small or new product, where there is not enough traffic anyway, leaning on an expert’s experience often beats any test. Each comes with its own caveats and deserves its own discussion.
Doing it correctly: the parts that matter
Let’s follow the checkout team through a second test, now that they know how to set it up beforehand. The walkthrough below follows the frequentist form this post focuses on. A Bayesian test takes a genuinely different route, but it stands on the same groundwork: a real hypothesis, enough data, and a stopping rule fixed in advance.
It starts with writing the hypothesis3 down before anything ships. says the new button changes nothing; says it moves a specific metric in a specific direction. The team commits, on paper, to exactly what is being tested, on which metric, and which way they expect it to move.
Next they pick exactly one primary metric, checkout conversion, and treat everything else as secondary. The other numbers are still worth watching, but they do not get a vote in the decision. One metric decides the outcome, chosen before the data arrives, so there is nothing left to cherry-pick.
Before the test starts, they decide how much they can afford to be wrong. A test can mislead in two ways. It can declare the bolder button a winner when both buttons really perform the same, so the team ships a change that does nothing. Or it can overlook a button that genuinely converts better, so a real improvement gets dropped. They pick a small acceptable risk for each up front. The first, the risk of falsely calling a winner, is the significance level ()5, usually set to 5%: a one-in-twenty chance of being fooled by noise. The tighter they set either risk, the more data the test will need.
How big the test needs to be follows above all from how small an effect they want to be able to detect. The smaller that target, the more users it takes, and the cost climbs fast. How long it needs to run then follows from that sample size and how fast users arrive.
Then they let it run until it reaches that planned sample size, instead of stopping the moment it looks good. This is the peeking trap from before: ending early because the numbers look good still raises the chance of a fake winner. If continuous monitoring is genuinely needed, there are methods built for it7. The default is to decide the finish line in advance and hold it.
Finally they read the result correctly, and that hinges on what the -value4 actually means. It answers one narrow question: if the new button really changed nothing, how often would chance alone still produce a difference at least this big? The smaller it is, the worse chance works as an explanation. They compare it against the 5% line they fixed at the start: come in below it and the result counts as significant, good evidence the difference is real. Even then, the lift can be so small it is not worth shipping.
Conclusion
So, do I like A/B tests? Genuinely, yes, for the situations where the method actually fits and gets used correctly. Run well, an A/B test turns an argument about the checkout button into an answer you can act on.
For everyone else, the clearest move is often not to test at all. A decision made openly on judgment stands on firmer ground than the same decision dressed up in a test that never fit. And not testing rarely means flying blind: a usability test, a handful of user interviews, or a quick survey often answers the question better than a forced A/B test would have.
Before the next test, it is worth asking plainly whether you even can: enough traffic, a real hypothesis, and the willingness to accept an unwelcome answer. If any of those is missing, you’ll learn more by skipping the test.
Sometimes the real obstacle is not statistics but corporate reality. A “data-backed” decision is easier to sell than a judgment call, and that pressure is real. But there are stronger ways to make that case than leaning on a test that never fit, and credibility borrowed from a result that does not hold up is credibility you have to earn back later.
None of this means you have to master the statistics yourself. The deep math has its experts, and there are tools that run the calculations for you. What you do need is enough grasp of the concepts to pick the right tool and trust the answer it gives you.
Which is the whole point. A tool only works when it fits the problem and gets used correctly.
A/B tests are no exception.
Disclaimer
This is only a general introduction to the methodology behind A/B tests. A/B testing itself, the product decisions around it, tracking, feature flags, and the math underneath all make up a wide field that one short article cannot cover. I also have to admit that I do not master that math 100% myself. In everyday work I lean on tools instead of running everything by hand, and the tooling is its own story: you rarely run a single test, you run many, and that takes a whole pipeline and ecosystem around it.
Footnotes
-
Correlation vs. causation: two things happening together does not mean one causes the other. Rain and a wet street show up at the same time, but the wet street does not cause the rain. The rain causes the wet street. Correlation alone does not tell you which way the arrow points, or whether some third factor drives both. The fallacy of reading causation into mere co-occurrence has a name: cum hoc ergo propter hoc. ↩ ↩2 ↩3
-
The scientific method: the systematic, evidence-based way of building knowledge. You form a hypothesis, test it against data from a controlled experiment, and accept or reject it based on what the data shows rather than on intuition or authority. ↩
-
Hypothesis: a proposed explanation, stated in advance and specific enough that an experiment could prove it wrong. It is the claim a test is built to check, not a conclusion reached afterwards. ↩ ↩2
-
-value: if there were no real effect, how often would random variation alone produce a difference at least this large? That probability is the -value, and the smaller it is, the weaker the case that there is no real effect. ↩ ↩2
-
Significance level (): the line you draw in advance for how much surprise counts as real. A common choice is 5%: if a result this extreme would happen less than 5% of the time by pure chance, you call it significant. It also fixes how often you accept being wrong. ↩ ↩2
-
Planned sample size: the number of users, worked out in advance, that a test needs before it can reliably detect an effect of the size you care about. ↩
-
Sequential testing: a family of statistical methods that let a running test be checked continuously without inflating the error rate, so looking early is valid by design. ↩