A/B Testing Statistical Significance: Peeking, Test Direction, and Inconclusive Results
Your sample size is set and your test is live. Now comes the part where most experimentation programs quietly undermine themselves by checking results too early, choosing the wrong test direction, or misreading a flat result. This guide covers all three.
The Peeking Problem
Frequentist A/B testing is built on a single, non-negotiable premise: the significance threshold produces valid results only when evaluated at one predetermined moment, after a fixed sample size has been fully collected. This is what statisticians call a fixed-horizon design, and it has a critical implication that many practitioners ignore entirely.
When an analyst watches live results and stops the test as soon as the numbers look favorable, they are not running the experiment they designed. Early data is inherently noisy. Metrics swing considerably before enough observations have accumulated for estimates to stabilize, and every check made with intent to stop introduces another opportunity for random variance to be misread as a real effect. The cumulative result is a Type I error rate far higher than the confidence level displayed on the dashboard. This is the peeking problem, and its consequences are concrete: teams ship changes that do nothing, or actively hurt conversion performance, while their tools report a statistically significant win.
What Peeking Actually Does to Your False Positive Rate
Each additional check of a live experiment compounds the risk of a false positive. The chart below shows the actual false positive rate an analyst incurs by stopping the moment the standard 5% significance threshold is crossed, depending on how many times results were reviewed:
The solution is straightforward to state and genuinely difficult to enforce: commit to the sample size calculated by ABTestPro before traffic flows, and do not make a stopping decision until that number is reached. Monitoring results mid-experiment is not inherently harmful, but any decision to stop or extend the test based on what is seen invalidates the statistical framework entirely. The target audience size is not a suggestion. It is the boundary condition that makes the significance calculation meaningful.
One-Sided vs. Two-Sided Tests: Which Should You Use?
Every A/B test in ABTestPro requires you to select a test direction. The choice between a one-sided and two-sided test has a direct and significant impact on how much traffic your experiment requires and how quickly it can reach significance.
One-Sided Test
Tests only whether Variation B is better than Variation A. Focuses all statistical power in the direction that matters for conversion.
- Reaches significance with 20–30% less data than a two-sided test
- Enables shorter test durations, meaning more experiments per quarter
- Ideal when the business would keep the control regardless of a negative result
- Selected via the “One-sided” option in ABTestPro's Test Setup panel
Two-Sided Test
Tests whether Variation B is better or worse than Variation A. Splits statistical power equally across both directions.
- Requires 20–30% more data to reach the same confidence level
- Appropriate when a negative result would trigger a different action than simply reverting
- Required for regulatory, compliance, or safety-critical experiments
- Selected via the “Two-sided” option in ABTestPro's Test Setup panel
Why One-Sided Tests Are Superior for Conversion Optimization
A two-sided test allocates the Type I error threshold (α) equally across both tails of the distribution. Specifically, it tests whether Variation B is significantly better or significantly worse than Variation A. In a standard conversion optimization context, this is statistically wasteful. If a tested variation is significantly worse, the organization simply reverts to the control. The negative direction produces no different business outcome than an inconclusive result.
A one-sided test focuses its entire evidential threshold on the direction that drives conversion decisions: the positive lift. This efficiency means organizations can execute experiments at a substantially accelerated velocity, securing winning variations sooner and increasing the total throughput of the experimentation program without requiring additional traffic.
What Do I Do If My Results Are Inconclusive?
An inconclusive test (one that fails to reject the null hypothesis) is not a failure. It is a critical, highly valuable data point. A flat result tells you that, at your current traffic levels and chosen Expected Lift threshold, no statistically reliable difference was detected. That is useful information. The mistake is treating it as a dead end rather than a diagnostic starting point.
When a test flatlines, execute this three-step protocol before drawing any conclusions:
Examine Guardrail Metrics
Before declaring a test truly flat, confirm the variation did not inadvertently suppress a secondary system health metric. An experiment can appear neutral on its primary metric while quietly damaging page load speed, increasing support ticket volume, or elevating cart abandonment rate. If a guardrail metric has moved significantly in the negative direction, the test is not inconclusive. It is a net-negative result that should be killed.
Segment the Data
Break the inconclusive aggregate data down by device type, traffic source, and user type (new vs. returning visitors). A test that appears flat overall can conceal a large, statistically significant victory on one segment that is being masked by an equally large failure on another. A checkout redesign, for instance, may show no overall lift while driving a 15% conversion improvement on mobile that is entirely offset by a degradation on desktop. Segmented results can directly inform the next, more targeted hypothesis.
Evaluate for Non-Inferiority
If the tested variation relies on cleaner underlying code, cheaper server infrastructure, or streamlined backend maintenance, it does not need to mathematically beat the control to be highly valuable. If the data proves the new variation is no worse than the control, meaning it falls within a predefined non-inferiority margin (for example, conversion rate did not decline by more than 1%), the variation should be deployed to capture the operational benefits without sacrificing conversion performance.
Run Tests You Can Trust
Set your sample size before the test starts. Commit to running it to completion. ABTestPro calculates exactly what you need.
Open the Free Calculator