How many users do I need for an A/B test?

It depends on your baseline conversion rate and the smallest effect you want to detect. For a 3.0% baseline and a 1.0-point minimum detectable effect, you need roughly 4,656 users per variant.

How is A/B test sample size calculated?

A common quick estimate is 16 times the baseline rate times one minus the baseline rate, divided by the square of the minimum detectable effect. The 16 bundles the usual 95% confidence and 80% power assumptions.

Why does a smaller detectable effect need so many more users?

Because required sample scales with the inverse square of the effect. Halving the effect you want to detect roughly quadruples the users you need, so chasing tiny improvements is expensive to prove.

What if I do not have enough traffic?

Either target a larger detectable effect, run the test longer to accumulate the sample, or accept lower confidence. The duration calculator turns a required sample into the days it will take at your traffic.

Calculator · 016

A/B Test Sample Size Calculator

Estimate the users per variant a test needs to detect your target effect — and decide whether the test is worth running at your current traffic.

Baseline conversion rate

Minimum detectable effect

pts

Minimum detectable effect: the smallest absolute change in conversion rate you want the test to reliably detect.

Sample size per variant

—

Average

Scenario lens Current · Benchmark · Optimized

Leverage

Formula

Sample size per variant ≈ 16 × p × (1 − p) / d²

Understanding A/B test sample size

Reference material — the calculator above stays the primary tool.

What sample size tells you

Sample size is the number of users each variant needs before a test can reliably tell a real effect from noise. It is set before the test runs, from two things: your baseline conversion rate and the smallest change worth detecting.

Running with too small a sample is the most common testing mistake — it produces confident-looking results that vanish on rollout. Sizing the test first is what makes the eventual verdict trustworthy.

How to read your result

Here, fewer users is better — an easier, faster test — read against common testing sizes:

Low — far above typical sizes; the effect or baseline makes a reliable test expensive. Average — near the common benchmark; runnable with planning. Strong — below typical sizes; the test is cheap to run reliably.

The inverse-square trap

The single most important property of this formula is that sample size scales with the inverse square of the detectable effect. Wanting to detect a 0.5-point change instead of 1.0 point does not double the sample — it quadruples it. This is why ambitious tests chasing small gains so often run out of traffic before they conclude.

Levers that change the requirement

Two inputs move the number. A larger minimum detectable effect shrinks the sample sharply, so test bold changes likely to move the needle, not cosmetic ones. A higher baseline rate also lowers the requirement. If neither can move, the lever shifts to time — run longer to accumulate the users.

This is an estimate, not a guarantee

The 16-times shortcut bundles standard 95% confidence and 80% power assumptions; a full power calculation can differ. Treat the result as the planning floor for users per variant, and pair it with the duration and significance tools to turn it into a timeline and a decision.