Jake McMahon
Led by Jake McMahon8+ years B2B SaaS · Behavioural Psychology & Big Data

Product experimentation for B2B SaaS teams.

Experimentation is not just A/B testing software. It is the operating habit of turning product questions into tests that produce a clear next decision.

This page is for teams trying to answer:

What should we test first Why our tests stall How to make learning build

A tool does not create an experimentation practice. A repeatable decision loop does.

Experimentation, Broken Down

01 — QuestionThe business decision the team needs to make
02 — DesignThe test, metric, sample, and runtime needed for a fair read
03 — ReadA result the team can trust instead of argue about
04 — RepeatWhat the next test should be because the previous one taught something real
EXPERIMENT VELOCITY GAP1–2 per quarter

Most SaaS teams run 1–2 experiments per quarter. Top-performing teams run 3–5 per week.

FALSE POSITIVE RATE30–60%

Teams that peek at results before reaching sample size inflate their false positive rate from 5% to 30–60%.

UNDERPOWERED TESTS80%

Of failed experiments fail because the sample size was too small to detect a meaningful effect, not because the change didn't work.

WHY EXPERIMENTATION PROGRAMS STALL

"We ran 20 experiments last year and shipped maybe 2 changes"

"Every experiment ends with 'inconclusive results.' We don't know if our tests are too small or our MDE is too ambitious. The team is losing faith in experimentation."

Head of Growth — B2B SaaS, Series B
"Our experiment results contradict each other"

"Last quarter we ran two tests on the same page. One showed a 15% lift, the other showed no effect. We shipped the winner and saw zero improvement. Nobody trusts the process anymore."

VP Product — PLG SaaS, $12M ARR
"We don't have enough traffic to run meaningful tests"

"Our signup page gets maybe 500 new users per week. That's not enough to run A/B tests with reasonable sample sizes. We've given up on testing and just go with the team's best guess."

Product Manager — Enterprise SaaS
"Our experiments take too long to get results"

"We calculated that we need 40,000 users per variant to detect a 10% lift. At our traffic that's a 12-week test. By the time we get results, the market has moved on."

Growth Lead — B2B SaaS, Series A

Experimentation is broader than A/B testing.

A/B tests are one format. Experimentation is the larger operating pattern around them: choosing questions worth testing, designing valid tests, reading them correctly, and making each result useful to the next decision.

When experimentation is working, the team learns faster than it ships blind changes. When it is not working, tests pile up without clarity. Some are underpowered. Some measure the wrong outcome. Some end with "directionally positive" and no one knows what to ship.

The point is not more tests. The point is a cleaner learning loop around product, onboarding, pricing, and retention decisions.

Most experimentation programs stall before the first trustworthy result.

The failure is rarely a lack of ideas. It is usually a setup problem, a measurement problem, or an interpretation problem.

The team starts with test ideas, not decision questions.

That leads to "should we try this?" instead of "what exactly are we trying to prove or rule out?" The result is activity without clarity.

The metrics are not ready for valid tests.

Event coverage is incomplete, north-star definitions are fuzzy, and the success metric is chosen too late. That breaks trust in the result before analysis even starts.

Tests are run in isolation.

Without a hypothesis library, shared review rhythm, or sequencing logic, every experiment resets the learning curve instead of building on the last one.

The team cannot interpret ambiguous outcomes.

Many SaaS tests are not clear wins or losses on day one. Without interpretation rules, teams either ship too early or abandon the test too soon.

Winning tests never ship to production.

Even when a test shows a clear lift, the rollout stalls in engineering queues or stakeholder review. The experiment produced an answer, but the organization cannot act on it.

There is no shared hypothesis library.

Each test lives in its own document, owned by one person. When someone leaves, the learning leaves with them. The team has no compounding knowledge, just a trail of disconnected experiments.

Three signs the practice is real.

01 — The Right Inputs

The team knows what deserves a test.

Questions come from bottlenecks, user behavior, or pricing pressure, not from a backlog of random ideas. The experiment exists to resolve a real decision.

02 — The Right Rules

The test can produce a trustworthy answer.

Instrumentation, primary metrics, sample expectations, and runtime logic are set before the test starts, so the team does not improvise the standard after results appear.

03 — The Right Follow-Through

Each result makes the next test sharper.

The practice has memory. Wins get rolled out correctly. Flat results still teach something. Failed ideas narrow the next move instead of disappearing into a doc nobody opens.

Build the learning loop, not just the test.

Most teams do not have an experimentation problem. They have a readiness problem and a decision problem.

ProductQuant approaches experimentation from the system backward. First define what the team needs to decide. Then check whether the current data layer can support a valid read. Then design the test with decision rules that match the reality of SaaS product change.

That is how the work compounds. The next experiment starts with better instrumentation, clearer hypotheses, and a better sense of which levers are worth touching at all.

01 — Pick

Choose the right question

Start with a bottleneck worth resolving, not a test idea looking for a home.

02 — Check

Verify experiment readiness

Make sure the event layer, metric definitions, and runtime expectations can support a valid read.

03 — Run

Design and interpret correctly

Use explicit rules for launch, runtime, and readout so the result does not become another debate.

04 — Sequence

Turn the result into the next move

Document what changed, what was learned, and what should be tested next because of it.

A strong experimentation practice improves because every test leaves the system clearer than it found it.

Go deeper from here.

These are the most relevant ProductQuant assets if you want practical experimentation detail, setup guidance, and examples of what real test discipline looks like.

CLIENT WORK

B2B SaaS — Experiment Program Design
2 weeks
from readiness audit to first valid test running

Experiment Readiness: From Scattered Tests to a Learning Loop

Assessed instrumentation, metric definitions, and traffic volumes for a Series B SaaS team. Rebuilt the experiment design process so every test had a pre-defined primary metric, minimum detectable effect, and interpretation rule before launch.

See the audit →
PLG SaaS — Hypothesis Library
6
sequenced experiments producing compounding decisions

Experimentation Program: From Isolated Tests to Compounding Knowledge

Built a hypothesis library and experiment sequencing framework for a PLG SaaS team — connecting each test result to the next question so the program produced cumulative learning instead of isolated data points.

See the velocity program →
HEALTHCARE SaaS — Experiment Pipeline
47
decisive experiments in 12 months, up from 3 the prior year

Experiment Pipeline: From Scattered Tests to a Structured Program

A healthcare SaaS team had run 3 experiments the prior year with no clear decisions. We built an experiment pipeline with a hypothesis library, pre-defined decision rules, and a monthly review cadence. Within 12 months the team shipped 47 decisive experiments with clear outcomes.

Read the case study →
ECOMMERCE SaaS — Activation Experimentation
20% → 35%
activation rate in under 12 months via disciplined testing

Activation 20% to 35% via Experimentation

An ecommerce SaaS team was stuck at 20% activation. By running a sequenced set of experiments across onboarding friction, time-to-value, and first-action design, the team moved activation to 35% in under a year — each test informing the next.

Read the case study →

Pick the step that matches the gap.

If the team wants a real experimentation practice instead of scattered tests, these are the most relevant ProductQuant paths.

Jake McMahon — experimentation consultant

WHO DOES THIS WORK

Jake McMahon

Founder, ProductQuant · MSc Big Data & Business Analytics · BSc Behavioural Psychology · 8+ years B2B SaaS

Jake has helped B2B SaaS teams build experimentation practices that produce clean decisions rather than activity. The work covers readiness assessment (instrumentation, metric definition, traffic calculation), experiment design, interpretation rules, and the hypothesis library that turns individual tests into compounding product knowledge.

Experimentation strategy A/B test design Statistical significance Feature flags Experiment readiness Hypothesis management PostHog experiments B2B SaaS

COMMON QUESTIONS

Product experimentation: what it is and what it should produce

Questions about your specific situation? Book a call →

What is product experimentation?+
Product experimentation is running controlled tests — A/B or multivariate — to measure whether a product change improves a target metric. Done correctly it requires a control/variant split, a pre-defined primary metric, a minimum detectable effect, and a sample size and runtime set before the test starts. Experimentation that skips those inputs produces directional noise, not decisions.
How do you run an A/B test in a B2B SaaS product?+
Define hypothesis, primary metric, and minimum detectable effect before launching. Split traffic randomly between control and variant. Run for full business cycles — minimum 2 weeks to capture weekly usage patterns. Analyse on an intent-to-treat basis (all users exposed, not just those who engaged). Document results including null results. Shipping on "directionally positive" without significance is not a valid A/B test result.
What sample size do you need for a valid A/B test?+
Sample size depends on your baseline conversion rate and the minimum detectable effect (MDE) — the smallest improvement worth detecting. Use a sample size calculator before starting. Most B2B SaaS products need 5002,000 users per variant for conversion tests. Low-traffic products often cannot run valid tests without traffic pooling or accepting a larger MDE.
How do you decide what to test?+
Prioritise by: expected impact × confidence in hypothesis ÷ effort. Test activation funnel steps first — they have the highest downstream leverage. Test copy and friction before design. The question to answer is: "If this test confirms our hypothesis, what do we ship and how much does it matter?" If that answer is unclear, the test is not worth running yet.
What is the difference between A/B testing and feature flags?+
An A/B test measures the impact of a change with statistical rigor — it tells you whether a variant is better or worse with confidence. A feature flag controls rollout — it lets you show a feature to a subset of users without measuring impact formally. You can combine them: use a feature flag to split traffic, then measure outcomes through your analytics layer to get an A/B test result. Many teams use feature flags without the measurement layer and call it testing.
How many tests can you run simultaneously?+
It depends on traffic volume and whether tests overlap in the funnel. In low-traffic B2B SaaS, usually 12 concurrent tests maximum — more tests dilute sample per variant and increase the risk of interaction effects (where one test changes the context for another). Running more tests simultaneously is only valid if the tests affect completely separate, non-overlapping user journeys.
What do you do when a test result is inconclusive?+
An inconclusive result usually means the test was underpowered, the metric was too noisy, or the change was too small to matter. Check whether you reached the required sample size. If you did and still see no signal, that is useful information: the change likely does not move the needle enough to justify shipping. Document it, move on, and test something with a larger expected effect size.
How do you run experiments with low traffic?+
Low-traffic teams can still experiment by: (1) testing higher in the funnel where volumes are larger, (2) accepting a larger minimum detectable effect, (3) pooling traffic across similar user segments, (4) using pre/post analysis with guardrails, or (5) running qualitative tests alongside quantitative ones. The key is matching the experiment design to the traffic reality instead of pretending you have more data than you do.
How long should an A/B test run?+
Long enough to reach the pre-calculated sample size and cover at least one full business cycle — typically 2–4 weeks for B2B SaaS. Do not stop a test early because the numbers look good. Do not run a test indefinitely because the numbers look flat. Set the runtime before launch based on traffic volume, baseline rate, and the smallest effect worth detecting.

Experimentation should make the next decision easier.

If your team has test ideas, a tool, and some dashboards but still does not have a real learning rhythm, start with the scorecard or readiness audit.