TL;DR

  • Running an experiment without prerequisites is like running a clinical trial without a protocol. You need a defined event taxonomy, a clear activation metric, and a hypothesis document before the first user is assigned to a variant.
  • PostHog experiments are built on feature flags. When you create an experiment, PostHog automatically creates a feature flag behind the scenes. The flag assigns users to variants; the experiment tracks the impact.
  • Sample size calculation is the most commonly skipped step. For a 3% baseline with 15% relative MDE, you need 23,000 visitors per variant. For 5% MDE, you need 210,000 — nearly 10× more data.
  • The peeking problem is the #1 cause of false positives. Each time you check significance before reaching your target sample size, you inflate your false-positive rate from 5% to potentially 20–30%.
  • HogQL is your forensic analysis tool. Use it to segment results by user properties, check guardrail metrics, and join experiment exposure with downstream revenue data.
  • PostHog is sufficient for most teams running fewer than 20 experiments per quarter. Graduate to Statsig or Optimizely only when you need CUPED variance reduction, MutEx isolation, or multi-armed bandits.

What You Need Before You Start

Setup PostHog A/B Experiments: The Technical Implementation Guide
Key insights on Setup PostHog A/B Experiments: The Technical Implementation Guide.

Before you create your first experiment, you need a foundation that most teams skip. Running an experiment without these prerequisites is like running a clinical trial without a protocol — you will generate data, but you will not be able to trust it.

Your event taxonomy needs to be defined before any experiment ships. This means you know exactly which events represent user activation, which events represent revenue, and which events represent friction. If "signed_up" and "account_created" are two different events in your PostHog project, fix that first. The experiment will track whatever events you tell it to track, and ambiguous event definitions produce ambiguous results. If your taxonomy is not yet in this shape, a PostHog consulting engagement covers the taxonomy design and implementation review before any experiment infrastructure is built on top of it.

You need a clear activation metric that sits upstream of your primary success metric. If your primary metric is "subscription_started," your activation metric might be "onboarding_completed." This tells you whether users who reach the experiment actually have a fair chance of converting. Without an activation metric, you cannot distinguish between "the variant did not work" and "users never reached the variant."

The final prerequisite is a hypothesis document — a short paragraph that states what change you are making, why you expect it to move the needle, what the primary metric is, what the guardrail metrics are, and what sample size you need before you will make a decision. This document is not bureaucracy. It is the single most effective defense against the peeking problem, which is the number one cause of false-positive experiment conclusions.

Step 1: Install and Configure PostHog

The fastest way to get PostHog into your codebase is through the PostHog Wizard:

npx @posthog/wizard@latest

For a manual installation, the PostHog JavaScript SDK is initialized with your project API key and host URL:

import posthog from 'posthog-js' posthog.init('<ph_project_api_key>', { api_host: 'https://us.i.posthog.com', person_profiles: 'identified_only', capture_pageview: true, capture_pageleave: true, })

The person_profiles: 'identified_only' setting means PostHog will only create person profiles for identified users, which keeps your data clean and your experiment cohorts accurate. For anonymous users (e.g., landing page visitors), PostHog still tracks events using a distinct_id stored in localStorage, so flag assignment persists across sessions even before identification.

The critical configuration detail: your API key must be the project API key (public), not the personal API key (private). The project key goes in client-side code. The personal key is for server-to-server API calls only. Mixing these up is a common setup error that either breaks tracking or exposes credentials.

Step 2: Set Up Feature Flags for A/B Testing

PostHog experiments are built on top of feature flags. When you create an experiment in the PostHog UI, it automatically creates a feature flag behind the scenes. This flag is what assigns each user to a variant — control, test, or any additional variants you define.

To create a new experiment, navigate to the Experiments section in your PostHog app and click New Experiment. The wizard walks you through three screens:

Screen 1: Experiment Description — the experiment name, a written hypothesis, and the feature flag key. PostHog auto-generates a flag key from the experiment name, but you can also link an existing flag if you already have one configured. If you link an existing flag, it must have at least two variants with control as the first variant.

import { useFeatureFlagVariantKey } from 'posthog-js/react' export function OnboardingFlow() { const variant = useFeatureFlagVariantKey('new-onboarding-test') if (variant === 'variant-b') { return <SimplifiedOnboardingFlow /> } return <StandardOnboardingFlow /> }

The React hook useFeatureFlagVariantKey returns the variant string for the current user. The key detail is that the flag evaluation is deterministic — the same user ID always gets the same variant, which is essential for maintaining experiment integrity across sessions and devices.

Screen 2: Variant Rollout — you set the traffic split (default is 50/50), the rollout percentage (you can start with 10% of traffic and expand), and the participant type. User-level assignment is the default. Group-level assignment is necessary when you need all users at the same organization to see the same variant — a common requirement in B2B SaaS where multiple users share a workspace.

Screen 3: Metrics — Primary metrics determine experiment success or failure. Secondary metrics help you monitor side effects. The most important rule: define all metrics before you launch. Adding a metric after the experiment has started does not corrupt the data, but it does tempt you to cherry-pick the metric that makes your variant look good.

Step 3: Design the Experiment

An experiment without a hypothesis is just a coin flip with analytics. The hypothesis gives your experiment structure, purpose, and a clear decision criterion.

A proper hypothesis has three components: the change you are making, the causal mechanism you expect, and the measurable outcome.

"We are simplifying the onboarding form from 8 fields to 4 fields because we expect reduced friction will increase activation rate, measured as the percentage of users who complete onboarding and reach the dashboard within 5 minutes."

That hypothesis tells you exactly what to measure, why you are measuring it, and what success looks like.

Your primary metric should be the single most important outcome the experiment is designed to move. This is typically a conversion rate, an activation rate, or a revenue metric. Pick one. Not three. Not a weighted composite. One. If you optimize for multiple primary metrics, you create ambiguity about what "success" means.

Your guardrail metrics are the metrics you need to ensure do not get worse. Common guardrail metrics include churn rate, API error rate, page load time, and support ticket volume. A variant can "win" on the primary metric while destroying performance on a guardrail metric — and shipping it would be a net negative for the business.

Define the minimum detectable effect you care about. If a 2% relative improvement in activation is too small to justify the engineering cost of maintaining a new onboarding flow, then your MDE should be set higher. The MDE is not a statistical parameter — it is a business parameter that tells the sample size calculator how much data you need.

Step 4: Calculate Sample Size

PostHog A/B test sample size sensitivity comparison
The Cost of Precision: Smaller detectable effects require exponentially more data.

Sample size calculation is the most commonly skipped step in the experiment process, and it is the reason most experiments are underpowered. An underpowered experiment has a high probability of missing a real effect — not because the variant does not work, but because you did not collect enough data to detect it.

The sample size depends on four inputs:

  • Baseline conversion rate — your current performance metric
  • Minimum detectable effect — the smallest relative improvement you care about
  • Statistical power — probability of correctly detecting a real effect (standard: 80%)
  • Significance level — your false-positive tolerance (standard: 5%, p < 0.05)

Here are two worked examples that show why this matters:

ParameterExample AExample B
Baseline conversion rate3%3%
Minimum detectable effect15% relative5% relative
Statistical power80%80%
Significance level5%5%
Required sample per variant23,000 visitors210,000 visitors
Duration at 2,000 visitors/day23 days210 days

The difference between Example A and Example B is only the MDE. Example A detects a 15% relative lift. Example B detects a 5% relative lift. The sample size difference is nearly 10×. This is the fundamental constraint of experimentation: smaller effects require exponentially more data.

Use Evan Miller's sample size calculator to compute your numbers. Input your baseline rate, your MDE, your desired power, and your confidence level.

The critical discipline: you decide the sample size before you launch, and you do not stop the experiment until you reach it. Violating this discipline is called the peeking problem.

Step 5: Run the Experiment

With the experiment launched and sample size calculated, the hardest part begins: waiting. The temptation to check results daily and stop the moment you see significance is enormous. This temptation is what destroys experiment validity.

The peeking problem works like this. If you check your results after 100 users per variant, there is a meaningful chance you will see a "significant" result purely by random noise. If you stop and declare a winner based on that early peek, you have inflated your false-positive rate from 5% to something much higher — potentially 20–30% or more.

PostHog's default statistical engine uses frequentist testing, which assumes a fixed-horizon design. The significance calculation is only valid when the experiment reaches its pre-determined sample size. If you check significance at 50% of the target sample size and the result shows p < 0.05, that p-value is not trustworthy.

A practical approach to running experiments well:

  1. Check technical health daily — are events flowing, are both variants receiving traffic, is there a sample ratio mismatch? Do not check statistical significance until the experiment reaches its target sample size.
  2. Set a calendar reminder for the day the sample size is reached, and evaluate results on that day and not before.
  3. Start with a small rollout — approximately 5% of traffic — for the first 2 to 3 days. This lets you verify events are logging correctly, monitor for crashes, and check session replays for unexpected behavior.
  4. Once technical health checks pass, expand to 100%.

Sample Ratio Mismatch (SRM) is one of the most important technical health indicators. If you configured a 50/50 split but your data shows 60/40, something is broken. SRM typically indicates that the feature flag is being called inconsistently. When SRM is detected, the experiment results are not trustworthy — pause, debug, and relaunch.

Step 6: Analyze Results with HogQL

The PostHog experiment results panel gives you the headline numbers — conversion rates for each variant, statistical significance, and recommended duration. But for B2B SaaS products with complex user journeys, the headline numbers rarely tell the full story. This is where HogQL becomes your forensic analysis tool.

Check Guardrail Metrics

Here's a HogQL query that checks whether the winning variant caused a spike in API errors:

SELECT properties.$feature/new-onboarding-test as variant, count(*) as error_count, count(distinct person_id) as users_impacted FROM events WHERE event = 'api_error' GROUP BY variant

Segment by User Properties

This query segments experiment results by tenant size, which can reveal Simpson's Paradox — where the overall result looks positive but hides negative effects in specific subgroups:

SELECT properties.$feature/onboarding_test as variant, person.properties.tenant_size as segment, count(distinct person_id) as users, count(distinct CASE WHEN event = 'activation' THEN person_id END) * 100.0 / count(distinct person_id) as activation_rate FROM events WHERE event IN ('$pageview', 'activation') GROUP BY variant, segment

Revenue Analysis

Join experiment exposure data with payment events to see if the variant that improved activation also improved downstream revenue:

SELECT e.properties.$feature/pricing_test as variant, count(distinct e.person_id) as exposed_users, count(distinct r.person_id) as paying_users, sum(r.properties.amount) as total_revenue FROM events e LEFT JOIN events r ON e.person_id = r.person_id AND r.event = 'payment_succeeded' WHERE e.event = '$feature_flag_called' GROUP BY variant

Performance note: HogQL queries on large date ranges or high-volume events can be slow. Shorten your date range or enable sampling if your query times out.

Step 7: Decision Framework

PostHog A/B test decision framework matrix
The Ship/Kill Matrix: A structured framework for deciding experiment outcomes.

Once the experiment reaches its target sample size and you have run your HogQL analyses, you need a structured framework for deciding whether to ship, iterate, or kill the variant. The worst outcome of an experiment is not a "loss." The worst outcome is an ambiguous result that leads to a decision based on gut feel rather than data.

Primary MetricGuardrail MetricsDecisionAction
Statistically significant win No degradation Ship Roll out to 100%, document learnings
Statistically significant win One or more degraded Do not ship Investigate root cause, iterate on variant
No significant difference No degradation Iterate or ship neutral If implementation cost is low, ship. If high, iterate.
No significant difference Degraded Kill Roll back, document why it failed
Statistically significant loss Any Kill Roll back immediately, document learnings

When the decision is "ship," roll out the feature flag to 100% of traffic over a graduated schedule — 50% for 1 day, 75% for 1 day, then 100%. This graduated rollout is your final safety net.

When the decision is "kill," document why the experiment failed. A failed experiment is not wasted effort — it is a validated hypothesis that this particular change does not produce the expected outcome. The best teams maintain a public log of both wins and losses so the organization builds cumulative learning.

Common Mistakes

  1. Peeking at results and stopping early. The single most common and most damaging error. Each peek is a new statistical test, inflating your false-positive rate. Decide the sample size before launch and do not evaluate significance until you reach it.
  2. Running underpowered tests. If your sample size calculation shows you need 23,000 users per variant but you stop at 5,000, your experiment is not just inconclusive — it's actively misleading. Underpowered tests that happen to show significance have exaggerated effect sizes (Type M error).
  3. Metric hacking. Changing the primary metric mid-experiment or choosing the metric that makes your variant look best after the fact. The hypothesis document locks in the primary metric before data collection begins.
  4. Ignoring sample ratio mismatch. If your experiment is configured for 50/50 traffic but the data shows 60/40, the results are not trustworthy. SRM indicates a technical bug.
  5. Including ineligible users. If your experiment tests a new pricing page but your metric includes users who never visited the pricing page, you are diluting your results with unaffected users.
  6. Testing during seasonal periods. Running an experiment during Black Friday or your industry's fiscal year-end produces data that does not reflect typical user behavior.

When to Graduate from PostHog to Statsig or Optimizely

PostHog is an excellent experimentation platform for startups and mid-market companies. It consolidates analytics, feature flags, session replays, and experiments into a single pipeline. But there is a point where the limitations become a bottleneck.

Signal 1: You need CUPED variance reduction. CUPED uses pre-experiment data to reduce noise, cutting experiment duration by 30–50%. Statsig and Optimizely support CUPED. PostHog does not.

Signal 2: You need mutually exclusive experiment groups. If you're running 10+ experiments simultaneously on overlapping user populations, you need MutEx isolation. Statsig has native MutEx support. PostHog excludes users exposed to multiple variants but doesn't provide systematic experiment isolation.

Signal 3: You need multi-armed bandit traffic optimization. A multi-armed bandit dynamically shifts traffic toward the better-performing variant during the experiment, minimizing opportunity cost. Statsig supports this. PostHog uses fixed traffic splits.

Signal 4: Your marketing team needs a no-code experiment builder. PostHog's no-code experiment feature is in beta. Optimizely has a mature visual editor. If your experiment volume is bottlenecked by engineering bandwidth, a no-code tool may be worth the investment.

For most companies running fewer than 20 experiments per quarter, PostHog is more than sufficient. The unified platform, generous free tier, and EU data hosting make it the best choice for engineering-led teams that value transparency and open-source flexibility.

FAQ

How much traffic do I need for a significant experiment?

Traffic requirements depend on your baseline conversion rate and the minimum detectable effect. For a typical B2B SaaS product with a 3% baseline and 15% relative MDE, you need approximately 23,000 visitors per variant. At 2,000 visitors per day, that translates to 23 days. If your MDE is only 5%, you need 210,000 visitors per variant — approximately 210 days at the same traffic level.

Can I run experiments on the backend?

Yes. PostHog's Python, Node.js, and Go SDKs support server-side feature flags. Server-side experiments are the gold standard for testing pricing logic, search algorithms, or database-heavy features where client-side flickering would degrade the user experience.

What is a Sample Ratio Mismatch and how do I fix it?

A sample ratio mismatch occurs when the actual traffic split between variants does not match the configured split. A 50/50 experiment showing 60/40 traffic indicates a technical bug. The most common causes: the feature flag is called too late, some users are filtered out before flag evaluation, or the SDK fails to evaluate for a subset of users. PostHog flags SRM automatically. When detected, pause, debug, and relaunch.

What's the difference between a feature flag and an A/B experiment?

A feature flag controls which users see a feature variant with no statistical evaluation — it's a deployment tool. An A/B experiment wraps a feature flag with a defined success metric, sample size calculation, and significance testing — it turns a flag into a measured hypothesis test. Every PostHog experiment uses a feature flag, but not every feature flag is part of an experiment.

Should I use Bayesian or frequentist statistics?

PostHog uses frequentist statistics by default, which calculates p-values based on a fixed-horizon test. Bayesian methods calculate the probability that one variant outperforms another based on prior beliefs and observed data. Bayesian methods are better for teams that want to ship incremental improvements continuously. Frequentist methods are better for teams that need clear, binary decisions with controlled false-positive rates. Neither is universally superior.

How do I prevent experiments from impacting each other?

PostHog's default behavior is to exclude users exposed to more than one variant from the analysis. This is conservative but reduces effective sample size. For more sophisticated isolation, use flag dependencies to ensure only one experiment's flag evaluates for a given user, or segment experiments by user properties. If you need systematic isolation at scale, Statsig's MutEx groups are the industry standard.

Sources

Free tools: Calculate the sample size your experiment needs → Sample Size Calculator · Check if your experiment results are significant → Statistical Significance Calculator

Jake McMahon

About the Author

Jake McMahon builds growth infrastructure for B2B SaaS companies — analytics, experimentation, and predictive modeling that turns product data into revenue decisions. He has designed and analyzed PostHog experiments across multiple engagements, helping teams move from guessing to knowing with statistical discipline. Book a diagnostic call to discuss your experimentation setup.

Next Step

Get Your Experiment Process Audited

We'll assess your current experiment design, sample size calculations, and decision framework — and tell you exactly what to fix first.