Common Pitfalls

A/B Testing Statistical Significance: Peeking, Test Direction, and Inconclusive Results

Your sample size is set and your test is live. Now comes the part where most experimentation programs quietly undermine themselves by checking results too early, choosing the wrong test direction, or misreading a flat result. This guide covers all three.

By the A/B Test Pro team · Updated February 15, 2026

Open the Calculator

Key takeaways

Checking results early ('peeking') inflates your false positive rate far beyond the 5% you intended.
Commit to the calculated sample size before launch and wait for it before making a call.
One-sided tests reach significance faster; use two-sided only when a negative result needs a different response.
An inconclusive test that ran to its full sample size is valid evidence, not a failure.

The Peeking Problem

Frequentist A/B testing is built on a single, non-negotiable premise: the significance threshold produces valid results only when evaluated at one predetermined moment, after a fixed sample size has been fully collected. This is what statisticians call a fixed-horizon design, and it has a critical implication that many practitioners ignore entirely.

When an analyst watches live results and stops the test as soon as the numbers look favorable, they are not running the experiment they designed. Early data is inherently noisy. Metrics swing considerably before enough observations have accumulated for estimates to stabilize, and every check made with intent to stop introduces another opportunity for random variance to be misread as a real effect. The cumulative result is a Type I error rate far higher than the confidence level displayed on the dashboard. This is the peeking problem, and its consequences are concrete: teams ship changes that do nothing, or actively hurt conversion performance, while their tools report a statistically significant win.

What Peeking Actually Does to Your False Positive Rate

Each additional check of a live experiment compounds the risk of a false positive. The chart below shows the actual false positive rate an analyst incurs by stopping the moment the standard 5% significance threshold is crossed, depending on how many times results were reviewed:

False Positive Rate Inflation from Peeking

Stopping at 5% significance threshold. Bars show your actual false positive rate.

No peeking

2 peeks

~8%

3 peeks

~11%

5 peeks

~14%

10 peeks

~19%

Acceptable (≤ 5%)

Elevated risk

Severely inflated

Values are illustrative approximations; actual inflation depends on check timing and intervals.

Source: Evan Miller, “How Not to Run an A/B Test”

If you peek 10 times: An experiment stopped at the standard 5% significance threshold has an actual false positive rate approaching 19%. Your “95% confident winner” is actually only ~81% reliable. You risk deploying features that damage baseline conversion rates while your reporting dashboard says otherwise.

The solution is straightforward to state and genuinely difficult to enforce: commit to the sample size calculated by ABTestPro before traffic flows, and do not make a stopping decision until that number is reached. Monitoring results mid-experiment is not inherently harmful, but any decision to stop or extend the test based on what is seen invalidates the statistical framework entirely. The target audience size is not a suggestion. It is the boundary condition that makes the significance calculation meaningful.

One-Sided vs. Two-Sided Tests: Which Should You Use?

Every A/B test in ABTestPro requires you to select a test direction. The choice between a one-sided and two-sided test has a direct and significant impact on how much traffic your experiment requires and how quickly it can reach significance.

Recommended for most tests

One-Sided Test

Tests only whether Variation B is better than Variation A. Focuses all statistical power in the direction that matters for conversion.

Reaches significance with 20–30% less data than a two-sided test
Enables shorter test durations, meaning more experiments per quarter
Ideal when the business would keep the control regardless of a negative result
Selected via the “One-sided” option in ABTestPro's Test Setup panel

Use when negative risk is meaningful

Two-Sided Test

Tests whether Variation B is better or worse than Variation A. Splits statistical power equally across both directions.

Requires 20–30% more data to reach the same confidence level
Appropriate when a negative result would trigger a different action than simply reverting
Required for regulatory, compliance, or safety-critical experiments
Selected via the “Two-sided” option in ABTestPro's Test Setup panel

Why One-Sided Tests Are Superior for Conversion Optimization

A two-sided test allocates the Type I error threshold (α) equally across both tails of the distribution. Specifically, it tests whether Variation B is significantly better or significantly worse than Variation A. In a standard conversion optimization context, this is statistically wasteful. If a tested variation is significantly worse, the organization simply reverts to the control. The negative direction produces no different business outcome than an inconclusive result.

A one-sided test focuses its entire evidential threshold on the direction that drives conversion decisions: the positive lift. This efficiency means organizations can execute experiments at a substantially accelerated velocity, securing winning variations sooner and increasing the total throughput of the experimentation program without requiring additional traffic.

When to use two-sided: If your variation involves a significant infrastructure change, a pricing experiment, or any scenario where a negative result would prompt a materially different investigation or response (rather than a simple revert), a two-sided test is the more rigorous choice.

What Do I Do If My Results Are Inconclusive?

An inconclusive test (one that fails to reject the null hypothesis) is not a failure. It is a critical, highly valuable data point. A flat result tells you that, at your current traffic levels and chosen Expected Lift threshold, no statistically reliable difference was detected. That is useful information. The mistake is treating it as a dead end rather than a diagnostic starting point.

When a test flatlines, execute this three-step protocol before drawing any conclusions:

Examine Guardrail Metrics

Before declaring a test truly flat, confirm the variation did not inadvertently suppress a secondary system health metric. An experiment can appear neutral on its primary metric while quietly damaging page load speed, increasing support ticket volume, or elevating cart abandonment rate. If a guardrail metric has moved significantly in the negative direction, the test is not inconclusive. It is a net-negative result that should be killed.

Segment the Data

Break the inconclusive aggregate data down by device type, traffic source, and user type (new vs. returning visitors). A test that appears flat overall can conceal a large, statistically significant victory on one segment that is being masked by an equally large failure on another. A checkout redesign, for instance, may show no overall lift while driving a 15% conversion improvement on mobile that is entirely offset by a degradation on desktop. Segmented results can directly inform the next, more targeted hypothesis.

Evaluate for Non-Inferiority

If the tested variation relies on cleaner underlying code, cheaper server infrastructure, or streamlined backend maintenance, it does not need to mathematically beat the control to be highly valuable. If the data proves the new variation is no worse than the control, meaning it falls within a predefined non-inferiority margin (for example, conversion rate did not decline by more than 1%), the variation should be deployed to capture the operational benefits without sacrificing conversion performance.

Reframe the result: An inconclusive test that completes its full predetermined sample size is not wasted traffic. It is statistically valid evidence that the tested change does not move the primary metric at your chosen sensitivity level. That prevents the organization from deploying a variation that does nothing, which is a real, measurable cost saving in engineering and product resources.

Run Tests You Can Trust

Set your sample size before the test starts. Commit to running it to completion. ABTestPro calculates exactly what you need.

Open the Free Calculator