Jake McMahon
Led by Jake McMahon8+ years B2B SaaS · Behavioural Psychology & Big Data

A/B testing for B2B SaaS teams.

A/B testing is only useful when the team can trust the result and act on it. If tests run but nothing changes, the issue is usually the setup, not the idea.

This page is for teams trying to answer:

What makes a test valid? Why do experiments stall? What should we test next?

The point is not more tests. The point is clearer decisions.

A/B Testing, Broken Down

01 — QuestionWhat decision the test is actually supposed to answer
02 — SetupThe metric, segment, and experiment design that make the result trustworthy
03 — RunHow long the test needs to run and how the team watches it
04 — DecideWhat the result means and what the team does next
Experimentation program failure rate70%

of experimentation programs at Series A–C SaaS companies produce fewer than 2 actionable results per quarter, usually because of design and interpretation problems, not tool problems.

Minimum sample requirementOften 2,000+

For a B2B SaaS product with a clear binary success metric, most tests require at least 1,000 users per variant for a trustworthy read — traffic most teams underestimate.

Most common failureInterpretation

Teams that peek at results early and ship on “directionally positive” numbers are the most common cause of a program that never produces compounding learning.

Why A/B testing programs stall

"We run tests but they never produce clear decisions"

"We ran 11 A/B tests last quarter. Seven of them ended with 'directionally positive' or 'no clear winner.' One was stopped early because it looked bad. Two were 'positive' but we shipped nothing. The program has no velocity."

Head of Product — B2B SaaS, $25M ARR
"Our event layer is not clean enough to support valid tests"

"We want to run a retention experiment. But our 'retained user' definition uses three different events depending on which engineer built which feature. Any test result we produce is going to be questioned because the underlying metric isn't clean."

Growth Lead — PLG SaaS, Series A
"We don't have enough traffic to run proper experiments"

"We have 400 signups a month. Any A/B test we run would need to run for 6+ months to reach significance. So we either run underpowered tests and ship on bad data, or we run nothing and guess."

VP Product — Vertical SaaS, $12M ARR
"Every experiment resets the learning curve"

"We finish one test, write a doc, and then the next experiment starts from scratch. There's no shared hypothesis library, no compounding knowledge base, no sequencing logic. The team is running isolated tests instead of building a real learning practice."

Director of Product — B2B SaaS, $40M ARR

A/B testing is a way to make one decision with less guesswork.

A test compares one version of something to another so the team can see which change actually improves the metric that matters. That only works when the design, data, and interpretation are all sound.

Good tests are tied to a real product question. They use the right metric, run long enough, and end with a decision the team can use. That is what makes experimentation valuable.

Bad tests answer nothing useful. They are underpowered, poorly measured, or attached to a metric nobody cares about. The result is noise that looks like science.

Most testing problems are decision problems.

If the setup cannot produce a clear decision, the experiment was never useful.

The team launches tests before the measurement layer is ready.

If events, funnels, or properties are missing, the result cannot be trusted.

The metric is chosen because it is easy, not because it matters.

A test can look healthy and still fail to move the business question the team actually cares about.

The team reads the result too early.

Experiments need a clear runtime and a clear decision rule or they just become opinion fights with charts.

The result is inconclusive, so the team moves on.

Inconclusive usually means the setup was weak, the sample was too small, or the hypothesis was not worth testing.

Three signs the test setup is useful.

01 — Valid Question

The test answers one clear decision.

The team knows what it is trying to learn before the test starts, so the result has a purpose.

02 — Trustworthy Result

The setup supports the answer.

The metric, sample, and runtime are strong enough that the team can trust the conclusion.

03 — Clear Decision

The result changes what the team does next.

Ship, kill, or re-run are all valid. "Maybe" is not the finish line.

Start with the question and work backward.

A test is only useful if the team can trust the result and use it.

ProductQuant starts with the decision the team needs to make. Then the metric, sample, runtime, and instrumentation are set up around that decision. The result is an experiment that is actually worth running.

That means fewer tests that just create noise and more tests that help the team move faster with less debate.

01 — Define

Pick the decision

Know what the team wants to learn before the experiment is designed.

02 — Design

Choose the right metric

The primary metric must reflect the actual business question, not whatever is easiest to measure.

03 — Run

Set the runtime

The experiment needs enough time and enough data to support the conclusion.

04 — Decide

Use the result

The point of the test is a decision the team can actually act on.

If the result cannot change a decision, the test is too weak or too vague.

Go deeper from here.

These are the most relevant ProductQuant assets if you want implementation detail, statistical grounding, or a better experiment setup.

Client work

Healthcare SaaS — Experiment Design
114
clean events providing the foundation for valid experiments

Experiment-Ready Foundation: Clean Events, Trustworthy Results

Built a clean event taxonomy and analytics layer for a healthcare SaaS — creating the instrumentation foundation required to run valid A/B tests on activation, onboarding, and feature adoption flows.

Read the case study →
B2B SaaS — Experimentation Program
1 sprint
from diagnosis to first valid experiment running

Experiment Readiness: From Noise to Valid Tests

Assessed a B2B SaaS experiment program that was producing inconclusive results. Fixed the metric definition, cleaned the event layer, and designed the first experiment that produced a clear ship-or-kill decision.

See the readiness audit →
Healthcare SaaS — Experiment Program
47
decisive experiments in 12 months, $272K–$505K annual impact

From Zero to Experiment Pipeline: Built the Full Testing System

Built experiment pipeline from zero. Sample size discipline, pre-registered hypotheses, clear ship-or-kill decisions.

Read the case study →
E-commerce SaaS — Activation Testing
20% → 35%
activation improvement, $2.5M+ revenue found

Activation Lift Through Structured Experimentation

40+ missing events discovered. Full experimentation framework built on top of rebuilt analytics.

Read the case study →

Pick the step that matches the gap.

If you want help turning testing into a reliable system, these are the most relevant ProductQuant paths.

Jake McMahon — A/B testing consultant

Who does this work

Jake McMahon

Founder, ProductQuant · MSc Big Data & Business Analytics · BSc Behavioural Psychology · 8+ years B2B SaaS

Jake has helped B2B SaaS teams build experimentation practices that produce clean decisions rather than activity. The work covers readiness assessment (instrumentation, metric definition, traffic calculation), experiment design, interpretation rules, and the hypothesis library that turns individual tests into compounding product knowledge.

A/B testing Experiment design Statistical significance Feature flags Experiment readiness Hypothesis management PostHog experiments B2B SaaS

Common questions

A/B testing: what it is and what it should produce

Questions about your specific situation? Book a call →

What is A/B testing in product development?+
A/B testing is a randomised experiment comparing two versions of a feature, flow, or copy to measure which produces better outcomes on a defined metric. It requires statistical significance before declaring a winner. The point is not to run more tests — it is to produce decisions the team can act on with confidence.
When should you run an A/B test vs just ship the change?+
Test when: the change is significant, the metric is measurable, traffic supports a valid sample size, and you are genuinely uncertain about the outcome. Just ship when: the improvement is clearly better and the cost of being wrong is low. The decision to test should be proportional to the risk of being wrong and the reversibility of the change.
How do you calculate statistical significance for an A/B test?+
Use a significance calculator. Typically target p<0.05 (95% confidence). Also check power (80%+ recommended). The test must run for full business cycles — weeks, not hours. Running a test for three days and calling it significant is the most common experimentation mistake in B2B SaaS.
What are the most common A/B testing mistakes in B2B SaaS?+
The five most common: (1) ending tests too early (peeking at results); (2) testing multiple things at once; (3) using the wrong primary metric; (4) running tests on insufficient traffic; (5) not accounting for novelty effects in power users. Most programs that produce no useful results are failing on one or more of these points.
How do you run A/B tests with low traffic?+
Focus on higher-traffic steps in the funnel. Accept longer run times. Use Bayesian approaches that work with smaller samples. Prioritise tests with large expected effects. Avoid testing copy changes that require big samples to detect small differences. For most B2B SaaS products with under 1,000 signups per month, A/B testing should be selective, not a default habit.
What should you do when an A/B test shows no significant result?+
Document it as a null result. Analyse by segment to check for heterogeneous effects — sometimes the overall result is flat but one segment shows a strong signal. Consider whether the hypothesis was wrong or the effect was too small to detect. Avoid p-hacking by running subgroup analyses until something looks significant.
How many A/B tests should a SaaS team run per quarter?+
The right number depends on traffic, team size, and product surface area. For most B2B SaaS teams at Series A to C, 6–12 well-designed experiments per quarter is a healthy cadence. The goal is not test volume — it is decision velocity. Fewer tests with clear results beat more tests with ambiguous outcomes every time.
What is the difference between A/B testing and multivariate testing?+
A/B testing compares two versions of a single variable. Multivariate testing combines multiple variables to see which combination performs best. A/B tests are simpler, require less traffic, and produce clearer decisions. Multivariate tests need much larger sample sizes and are harder to interpret — most B2B SaaS teams should default to A/B testing and iterate sequentially.
How do you build a hypothesis library for A/B testing?+
Start by documenting every test: the hypothesis, the metric, the result, and what the team learned. Tag each test by product area (onboarding, activation, retention, pricing). Over time the library reveals patterns — which levers move which metrics, which hypotheses consistently prove wrong, and where the next highest-value experiments live. A shared hypothesis library is the difference between isolated tests and a real experimentation practice.

Good tests end with a decision.

If you are still trying to make the setup trustworthy, start with the guide or the readiness audit before you run another test.