Statistical Methods

A/B Test Sample Size: The Complete Guide to MDE, Power, and Significance

Choosing the wrong sample size is the single most common reason A/B tests fail to produce actionable results. This guide explains the three statistical parameters that determine how many visitors your test requires.

Open the Calculator

The Three Parameters That Determine Sample Size

To accurately design an A/B test, practitioners must master the interplay between three critical inputs: Confidence Level (significance), Power, and Expected Lift (also commonly called the Minimum Detectable Effect, or MDE). These three values form the mathematical foundation for calculating the required sample size of any experiment. Set them carelessly, and your results (positive or negative) are statistically untrustworthy.

Parameter 01
Confidence Level (Significance / Type I Error)
α / Alpha

The confidence level represents the predetermined probability of not rejecting the null hypothesis when it is actually true. Its inverse, the significance level (α), is the probability of a false positive: the statistical engine declares a winning variation when, in reality, the treatment provides no true benefit over the control. This is known as a Type I error.


By default, the industry standard is α = 0.05, meaning the organization accepts a 5% baseline risk of deploying a false positive. In the ABTestPro calculator, this corresponds to a Confidence setting of 95%, while 90% is often used for lower-risk changes. The critical value associated with this threshold is denoted Zₐ.

Parameter 02
Statistical Power (Type II Error)
1 − β / Beta

Power is the probability of correctly rejecting the null hypothesis when a true effect actually exists. Its inverse, the Type II error rate (β), is the probability of a false negative, meaning the test fails to detect a genuinely winning variation, leaving potential revenue on the table. A test with low statistical power has a high probability of committing a Type II error.


Power is typically set at 80% (β = 0.20), establishing a 20% risk of a false negative. The critical value associated with statistical power is denoted Zₑ. Higher power requires a larger sample size; the tradeoff is test duration.

Parameter 03
Expected Lift (Minimum Detectable Effect)
MDE / δ (Delta)

The Expected Lift (also widely referred to in the literature as the Minimum Detectable Effect, or MDE) is the smallest change in the primary metric that the experiment is calibrated to detect reliably. In ABTestPro, this is entered as the percentage change you want to detect between control and treatment.


Because the MDE is squared in the denominator of the sample size formula, reducing the MDE by half mathematically means the required traffic increases by 4X. This makes the Expected Lift the single most consequential input in the calculator.

The Mathematics Behind the Test Statistic

These three parameters are mathematically linked through a set of standard statistical formulas. Understanding their structure explains why certain calculator inputs produce dramatically longer test durations than others.

The Z-Test Statistic

To determine whether the observed difference between a control group and a treatment group is statistically significant, analysts compute a Z-score, a standardized measure of how many standard deviations the observed difference lies from zero:

Formula 1: Test Statistic

$$Z = \frac{\hat{p}_1 - \hat{p}_2}{SE}$$
$\hat{p}_1, \hat{p}_2$Observed conversion rates for control and treatment groups
$SE$Standard error of the difference between the two proportions

The resulting Z-score is compared against the critical value Zₐ derived from the chosen confidence level. If |Z| > Zₐ, the result crosses the significance threshold.

Standard Error for Two Proportions

For binary proportion metrics such as conversion rates, the standard error is calculated using each group's observed proportion independently, using the unpooled formulation. This approach treats each group's variance separately rather than assuming a shared underlying rate:

Formula 2: Unpooled Standard Error

$$SE = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$
$\hat{p}_1$Observed conversion rate of the control group
$\hat{p}_2$Observed conversion rate of the treatment group
$n_1, n_2$Sample sizes of the control and treatment groups

Required Sample Size Per Variation

When using ABTestPro to determine the required sample size before an experiment, the calculator solves for n, the minimum required audience size per variation:

Formula 3: Sample Size Per Variation

$$n = \frac{(Z_{\alpha} + Z_{\beta})^2 \cdot (p_1(1-p_1) + p_2(1-p_2))}{\delta^2}$$
$Z_{\alpha}$Critical value for the chosen confidence level (e.g., 1.645 for 95% one-sided; 1.96 for two-sided)
$Z_{\beta}$Critical value for the chosen power level (e.g., 0.842 for 80% power)
$p_1$Baseline conversion rate (entered as the Baseline Conversion Rate in the calculator)
$p_2$Expected conversion rate of the treatment group ($p_1 + \delta$)
$\delta$Absolute minimum detectable effect; the Expected Lift expressed as an absolute difference in proportions ($p_2 - p_1$)
⚠️
The MDE Quadrupling Rule: Because the Expected Lift (δ) appears squared in the denominator, reducing the MDE by half mathematically means the required traffic increases by 4X. A test designed to detect a 1% lift requires roughly four times as many visitors as one designed to detect a 2% lift, all else being equal.

This reality forces organizations to balance their desire to detect small, iterative improvements against the financial and opportunity cost of running protracted experiments. The ABTestPro calculator quantifies this tradeoff directly so you can make an informed decision before committing to a test.

Why a 50/50 Split Is Non-Negotiable

A common instinct in experimentation is to protect against a risky variation by giving it less traffic. Product managers frequently request 80/20 or 90/10 allocations when testing something disruptive, reasoning that a smaller exposed audience limits downside risk while still producing useful data. This reasoning feels conservative. It is not. It undermines the statistical validity of the entire experiment.

Given a fixed total audience size, the standard error of the difference between two proportions is minimized only when both groups are exactly equal in size. Any imbalance increases the standard error, which means more traffic is required to achieve the same level of statistical confidence. The formula makes this visible:

Formula 4: Standard Error of the Difference

$$SE_{diff} = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}$$
$n_1, n_2$Sample sizes of control and treatment groups. To minimize SE given a fixed total N = n₁ + n₂, the values n₁ and n₂ must be equal.

An unequal split concentrates variance in the smaller group. That group's estimates are noisier, and that noise propagates through the entire significance calculation. The test statistic must work harder to cross the Zₐ threshold, which either requires more total traffic or produces a less reliable result at the same traffic level. An 80/20 split does not halve the risk. It substantially inflates the cost of reaching a trustworthy conclusion.

Traffic SplitIllustrative Effect on PowerConsequence
50 / 50OptimalMaximum statistical efficiency; shortest test duration
70 / 30ReducedNoticeable power loss; requires more total traffic
80 / 20Significantly ReducedSubstantially longer test duration to compensate
90 / 10Severely DegradedPower can drop dramatically, making most tests near-unusable
The rule: A 50/50 split is not a stylistic convention. It is the mathematically optimal allocation for two-group experiments. It minimizes the required sample size, maximizes the reliability of the result, and ensures that any observed difference between groups reflects the variation being tested rather than an artifact of unequal variance. The ABTestPro calculator defaults to a 50/50 split and flags deviations in the Test Setup panel.
ℹ️
The correct way to de-risk a disruptive test is not to reduce the treatment's traffic allocation. Instead, define guardrail metrics that will automatically pause the experiment if a key business metric degrades beyond an acceptable threshold. This preserves statistical integrity while managing real-world risk.

How to Use the Statistical Power Sensitivity Table

The ABTestPro calculator is engineered to move beyond binary sample size generation by offering an advanced Statistical Power Sensitivity Table. This matrix shows the achievable statistical power at the intersection of every Expected Lift level and test duration, giving practitioners a full picture of the tradeoff space, not just a single answer.

Reading the Table

The y-axis (rows) represents different Expected Lift values, ranging from small incremental changes to large, disruptive effects. The x-axis (columns) represents test duration, which may be expressed in days, weeks, or other units depending on your audience frequency setting. Each cell shows the statistical power achievable at that specific lift-and-duration combination, given your entered audience size and baseline conversion rate.

Y-Axis (Rows)
Expected Lift

Each row represents a different lift threshold, ranging from small gains that require more time to large effects detectable quickly.

X-Axis (Columns)
Test Duration

Each column represents a duration increment. The unit (days, weeks, etc.) matches the audience frequency you entered in the calculator.

Each Cell
Power %

The statistical power you would achieve if you ran the test for that duration targeting that lift. The higher the value, the more confident you can be in detecting a real effect.

Your Goal
Navigate

Find the row matching your target lift, then scan across to the column where power first reaches an acceptable threshold. That is your minimum required test duration.

A Practical Business Example

An analyst at a mid-size e-commerce company inputs their weekly audience size and baseline checkout conversion rate into ABTestPro. Reading across the Sensitivity Table reveals a picture like this:

Expected Lift (Y-Axis Row)Duration to Reach 80% PowerStrategic Read
2% lift~60 daysRequires nearly 9 weeks of uninterrupted traffic to reach adequate power
3% lift~30 daysOne month, feasible within a standard sprint cycle
5% lift~14 daysTwo weeks, ideal for high-cadence experimentation programs

The business must then make a strategic determination: does the organization wait an extra six weeks to detect a minor incremental gain, or does it launch a bolder, more disruptive test immediately to search for the 5% lift? Neither answer is universally correct. The right choice depends on traffic cost, sprint length, and revenue targets.

💡
Pro tip: Use the Sensitivity Table before finalizing your hypothesis, not after. If your test hypothesis predicts a 1% lift but the table shows power barely reaches 40% after months of traffic, you have a resourcing problem, not a testing problem. The table empowers you to either sharpen the hypothesis or adjust the test's scope before a single visitor is exposed.

The Sensitivity Table bridges the gap between abstract statistical theory and practical project management. In an enterprise environment, traffic is a finite resource and testing bandwidth is limited. This matrix allows data scientists and product managers to align their testing cadence directly with revenue targets and business cycles, quantifying the risk-reward calculus that is otherwise left to intuition alone.

Ready to Calculate Your Sample Size?

Put these concepts to work. Enter your audience size, baseline conversion rate, and Expected Lift in the free ABTestPro calculator.

Open the Free Calculator