A/B Test Sample Size: The Complete Guide to MDE, Power, and Significance
Choosing the wrong sample size is the single most common reason A/B tests fail to produce actionable results. This guide explains the three statistical parameters that determine how many visitors your test requires.
The Three Parameters That Determine
Sample Size
To accurately design an A/B test, practitioners must master the interplay between three critical inputs: Confidence Level (significance), Power, and Expected Lift (also commonly called the Minimum Detectable Effect, or MDE). These three values form the mathematical foundation for calculating the required sample size of any experiment. Set them carelessly, and your results (positive or negative) are statistically untrustworthy.
The confidence level represents the predetermined probability of not rejecting the null hypothesis when it is actually true. Its inverse, the significance level (α), is the probability of a false positive: the statistical engine declares a winning variation when, in reality, the treatment provides no true benefit over the control. This is known as a Type I error.
By default, the industry standard is α = 0.05, meaning the organization accepts a 5% baseline risk of deploying a false positive. In the ABTestPro calculator, this corresponds to a Confidence setting of 95%, while 90% is often used for lower-risk changes. The critical value associated with this threshold is denoted Zₐ.
Power is the probability of correctly rejecting the null hypothesis when a true effect actually exists. Its inverse, the Type II error rate (β), is the probability of a false negative, meaning the test fails to detect a genuinely winning variation, leaving potential revenue on the table. A test with low statistical power has a high probability of committing a Type II error.
Power is typically set at 80% (β = 0.20), establishing a 20% risk of a false negative. The critical value associated with statistical power is denoted Zₑ. Higher power requires a larger sample size; the tradeoff is test duration.
The Expected Lift (also widely referred to in the literature as the Minimum Detectable Effect, or MDE) is the smallest change in the primary metric that the experiment is calibrated to detect reliably. In ABTestPro, this is entered as the percentage change you want to detect between control and treatment.
Because the MDE is squared in the denominator of the sample size formula, reducing the MDE by half mathematically means the required traffic increases by 4X. This makes the Expected Lift the single most consequential input in the calculator.
The Mathematics Behind the Test Statistic
These three parameters are mathematically linked through a set of standard statistical formulas. Understanding their structure explains why certain calculator inputs produce dramatically longer test durations than others.
The Z-Test Statistic
To determine whether the observed difference between a control group and a treatment group is statistically significant, analysts compute a Z-score, a standardized measure of how many standard deviations the observed difference lies from zero:
Formula 1: Test Statistic
The resulting Z-score is compared against the critical value Zₐ derived from the chosen confidence level. If |Z| > Zₐ, the result crosses the significance threshold.
Standard Error for Two Proportions
For binary proportion metrics such as conversion rates, the standard error is calculated using each group's observed proportion independently, using the unpooled formulation. This approach treats each group's variance separately rather than assuming a shared underlying rate:
Formula 2: Unpooled Standard Error
Required Sample Size Per Variation
When using ABTestPro to determine the required sample size before an experiment, the calculator solves for n, the minimum required audience size per variation:
Formula 3: Sample Size Per Variation
This reality forces organizations to balance their desire to detect small, iterative improvements against the financial and opportunity cost of running protracted experiments. The ABTestPro calculator quantifies this tradeoff directly so you can make an informed decision before committing to a test.
Why a 50/50 Split Is Non-Negotiable
A common instinct in experimentation is to protect against a risky variation by giving it less traffic. Product managers frequently request 80/20 or 90/10 allocations when testing something disruptive, reasoning that a smaller exposed audience limits downside risk while still producing useful data. This reasoning feels conservative. It is not. It undermines the statistical validity of the entire experiment.
Given a fixed total audience size, the standard error of the difference between two proportions is minimized only when both groups are exactly equal in size. Any imbalance increases the standard error, which means more traffic is required to achieve the same level of statistical confidence. The formula makes this visible:
Formula 4: Standard Error of the Difference
An unequal split concentrates variance in the smaller group. That group's estimates are noisier, and that noise propagates through the entire significance calculation. The test statistic must work harder to cross the Zₐ threshold, which either requires more total traffic or produces a less reliable result at the same traffic level. An 80/20 split does not halve the risk. It substantially inflates the cost of reaching a trustworthy conclusion.
| Traffic Split | Illustrative Effect on Power | Consequence |
|---|---|---|
| 50 / 50 | Optimal | Maximum statistical efficiency; shortest test duration |
| 70 / 30 | Reduced | Noticeable power loss; requires more total traffic |
| 80 / 20 | Significantly Reduced | Substantially longer test duration to compensate |
| 90 / 10 | Severely Degraded | Power can drop dramatically, making most tests near-unusable |
How to Use the Statistical Power Sensitivity Table
The ABTestPro calculator is engineered to move beyond binary sample size generation by offering an advanced Statistical Power Sensitivity Table. This matrix shows the achievable statistical power at the intersection of every Expected Lift level and test duration, giving practitioners a full picture of the tradeoff space, not just a single answer.
Reading the Table
The y-axis (rows) represents different Expected Lift values, ranging from small incremental changes to large, disruptive effects. The x-axis (columns) represents test duration, which may be expressed in days, weeks, or other units depending on your audience frequency setting. Each cell shows the statistical power achievable at that specific lift-and-duration combination, given your entered audience size and baseline conversion rate.
Each row represents a different lift threshold, ranging from small gains that require more time to large effects detectable quickly.
Each column represents a duration increment. The unit (days, weeks, etc.) matches the audience frequency you entered in the calculator.
The statistical power you would achieve if you ran the test for that duration targeting that lift. The higher the value, the more confident you can be in detecting a real effect.
Find the row matching your target lift, then scan across to the column where power first reaches an acceptable threshold. That is your minimum required test duration.
A Practical Business Example
An analyst at a mid-size e-commerce company inputs their weekly audience size and baseline checkout conversion rate into ABTestPro. Reading across the Sensitivity Table reveals a picture like this:
| Expected Lift (Y-Axis Row) | Duration to Reach 80% Power | Strategic Read |
|---|---|---|
| 2% lift | ~60 days | Requires nearly 9 weeks of uninterrupted traffic to reach adequate power |
| 3% lift | ~30 days | One month, feasible within a standard sprint cycle |
| 5% lift | ~14 days | Two weeks, ideal for high-cadence experimentation programs |
The business must then make a strategic determination: does the organization wait an extra six weeks to detect a minor incremental gain, or does it launch a bolder, more disruptive test immediately to search for the 5% lift? Neither answer is universally correct. The right choice depends on traffic cost, sprint length, and revenue targets.
The Sensitivity Table bridges the gap between abstract statistical theory and practical project management. In an enterprise environment, traffic is a finite resource and testing bandwidth is limited. This matrix allows data scientists and product managers to align their testing cadence directly with revenue targets and business cycles, quantifying the risk-reward calculus that is otherwise left to intuition alone.
Ready to Calculate Your Sample Size?
Put these concepts to work. Enter your audience size, baseline conversion rate, and Expected Lift in the free ABTestPro calculator.
Open the Free Calculator