Test Design

How to Define Your Metrics Before an A/B Test to Avoid Data Dredging

The most damaging mistakes in experimentation do not happen during the test. They happen before it. Teams that skip the measurement plan are most at risk. Without a predefined metric hierarchy, even a technically correct test can be selectively interpreted to confirm whatever the team already believed.

Open the Calculator

Why a Measurement Plan Is Non-Negotiable

The transition from a theoretical hypothesis to a live, production-ready experiment requires a meticulously documented measurement plan. This means deciding, before a single visitor enters the experiment, exactly which metric determines success, which metrics provide diagnostic context, and which metrics define the boundaries of acceptable risk.

Without this discipline, organizations inevitably fall into one of two traps. The first is data dredging: selectively mining post-test results for any positive signal that can be used to justify deploying a variation, regardless of whether the original hypothesis was validated. The second is institutional amnesia: the failure to document test results in a retrievable format, so that the same failed hypotheses are unknowingly repeated by different teams, or even the same team, months or years later.

What Data Dredging Looks Like in Practice

A team launches a new product page layout with the hypothesis that it will increase Add-to-Cart rate. The test concludes and Add-to-Cart rate shows no significant change. Rather than calling the test inconclusive, the analyst searches the data and finds that session depth increased by 12% and declares the test a winner on that basis. This is data dredging. The session depth metric was not the predefined success criterion, and at the number of metrics examined, a positive result by random chance was nearly inevitable. Deploying based on this “win” is statistically invalid.

⚠️
The multiple comparisons problem: If you test enough metrics, one will appear significant by random chance. At a 95% confidence level, approximately 1 in 20 metrics will show a false positive even when nothing real is happening. Defining the primary metric in advance and treating all others as secondary diagnostics is the only statistically valid defense against this.

The Metric Hierarchy: Primary, Secondary, and Guardrail

An experiment must be judged against a predefined framework of three metric tiers. Each tier serves a distinct function. Defining all three before traffic begins flowing is essential to maintaining statistical integrity and making clean deployment decisions.

Tier 1
Primary Metric: The Overall Evaluation Criterion (OEC)
The single, definitive metric that determines the ultimate success or failure of the experiment. Only one primary metric may be selected per test. This constraint is non-negotiable. Selecting multiple primary metrics invalidates the statistical integrity of the test by reintroducing the multiple comparisons problem through the back door. The primary metric must be directly tied to the core hypothesis and must be identified before the test launches.
Conversion RateAverage Order Value (AOV)Revenue Per Visitor (RPV)Sign-up RateCheckout Completion Rate
Tier 2
Secondary Metrics: Diagnostic Signals
Metrics that illuminate the mechanism behind the primary metric result. When a primary metric moves, secondary metrics indicate where in the user flow the change occurred. When a primary metric does not move, secondary metrics can reveal whether something else shifted upstream. Secondary metrics are not verdict metrics: they neither confirm nor deny the hypothesis. A primary metric win that is not corroborated by any secondary metric movement deserves scrutiny before deployment.
Click-Through Rate (CTR)Add-to-Cart RateForm Completion RateSession DepthTime on PageScroll Depth
Tier 3
Guardrail Metrics: Risk Boundaries
A class of metrics monitored specifically to catch negative consequences in areas of the product or business that the experiment was not designed to affect. Every guardrail carries a predefined tolerance threshold. If results cross that threshold in the wrong direction, the experiment is paused and reviewed regardless of what the primary metric shows. A conversion rate increase that comes at the cost of page load time or customer support volume is not a business win.
Page Load SpeedCart Abandonment RateBounce RateSupport Ticket VolumeBug Report RateRevenue Per Session

Guardrail Metrics in Practice

Guardrail metrics are easiest to understand through a concrete scenario. Consider the following real-world pattern that plays out regularly in e-commerce experimentation:

Example Scenario
An aggressive email capture pop-up tested on the homepage
Primary Metric: Email newsletter sign-up rate increases by 22%. The primary metric is a clear, statistically significant win.
Guardrail Metric 1: Immediate site exit rate increases by 15%. Visitors are closing the tab rather than dismissing the pop-up.
Guardrail Metric 2: Bounce rate increases by 18%. A significant portion of new visitors are not reaching any product page.
Verdict: Despite the primary metric win, both guardrail metrics have breached their predefined thresholds. The test must be flagged as a net-negative result and killed. Deploying this variation would trade short-term email acquisition for long-term damage to conversion funnel health.

This is why guardrail metrics must be defined, with their acceptable thresholds set, before the experiment launches. A team that defines guardrail metrics after seeing a primary metric win will be unconsciously biased toward setting lenient thresholds that justify the result they want to deploy.

Best practice: Document your complete metric hierarchy (primary metric, secondary metrics, guardrail metrics, and each guardrail's maximum acceptable degradation threshold) in writing before the test is activated. This document becomes the binding decision framework that removes subjective post-hoc interpretation from the deployment decision.

Documenting Results to Prevent Repeated Failures

A well-run experimentation program generates significant organizational knowledge over time, but only if results are systematically archived in a retrievable format. Without this, organizations fall victim to a costly pattern: the same failed hypotheses are repeated by new team members, new product managers, or even the same team after personnel changes. Months of traffic and engineering resources are consumed re-learning lessons the organization already paid to learn.

Every completed experiment, regardless of outcome, should produce a documented record that captures the hypothesis, the metric hierarchy used, the final result, and the key learnings. Inconclusive and negative results are especially important to archive, as they carry the highest risk of being forgotten and repeated.

Treat the test archive as a competitive asset. An organization with 200 documented experiments has a fundamentally different decision-making capability than one with the same volume of experiments but no institutional record. The former can identify patterns across hypotheses, avoid known dead ends, and build a systematic understanding of what actually drives conversion on their specific properties.

💡
Minimum viable test record: For each experiment, document the following: (1) the hypothesis in plain language, (2) the primary metric and its result, (3) whether any guardrail metrics were triggered, (4) the final deployment decision and its rationale, and (5) the next hypothesis this result suggests. Five fields. Consistent archiving of these five fields across every test compounds into an extraordinarily valuable strategic asset over time.

Ready to Design a Rigorous Experiment?

Start with the right sample size. Define your metrics. Run your test to completion. ABTestPro is built to support every step.

Open the Free Calculator