What is statistical significance in an A/B test?

It is the confidence that the difference between two variants is real rather than chance. The conventional decision bar is 95% confidence, meaning under a 5% chance the observed difference is a fluke.

How is A/B test significance calculated?

This uses a two-proportion z-test: it pools the two conversion rates, computes a z-score from the difference and the sample sizes, and converts that to a two-tailed confidence level.

Is a big lift the same as a significant result?

No. Lift is the size of the difference; significance is the confidence it is real. A large lift on a small sample can still be noise, and a small lift on a huge sample can be highly significant — always read both together.

What if my test is below 95% confidence?

The result is directionally suggestive but not yet reliable. Keep the test running to gather more sample, which usually pushes a genuine effect across the bar; if it drifts toward no difference instead, treat the variant as inconclusive.

Calculator · 017

A/B Test Significance Calculator

Measure the statistical confidence behind an A/B result — and decide whether the difference is real enough to ship or still too noisy.

Variant A conversions

count

Variant A visitors

count

Variant B conversions

count

Variant B visitors

count

Statistical confidence

—

Average

Scenario lens Current · Benchmark · Optimized

Leverage

Formula

Two-proportion z-test on A vs B conversion rates

Understanding statistical significance

Reference material — the calculator above stays the primary tool.

What significance measures

Statistical significance is the confidence that a measured difference between two variants reflects a real effect rather than random variation. It answers the only question that should gate a rollout: is this difference trustworthy enough to act on?

This calculator runs a two-proportion z-test — the standard method for comparing two conversion rates — and reports the resulting confidence directly.

How to read your result

Higher confidence is better, against the conventional 95% decision bar:

Low — under 90%; the difference is well within what chance could produce. Average — 90–95%; suggestive but not yet decision-grade. Strong — 95% or above; reliable enough to ship.

Significance is not lift

The two are independent. A variant can post a large observed lift yet sit below 95% confidence because the sample is thin, or post a modest lift that is highly significant on a large sample. Significance tells you whether to believe the result; the conversion lift tool tells you how big it is. Ship on both.

Why not stop the moment it crosses 95%

Peeking and stopping at the first significant moment inflates false positives, because confidence wobbles as data arrives. Decide the sample size in advance, run to it, then read significance once — the sample-size and duration tools set that plan so this verdict is honest.

One-tailed vs two-tailed

This calculator uses a two-tailed test, which asks whether the variants differ in either direction — the conservative, default choice for most experiments. A one-tailed test assumes you only care about improvement and reaches significance sooner, at the cost of missing a variant that quietly performs worse.