A typical A/B testing life cycle


A/B testing (or experimenting) is the one of widely used statistical approaches for improving quality of content or services.

There are two groups: control and alternative, they will be observed for investigating a statically significant difference in selected metrics.

Marketers use the approach for content optimization for getting a lot more conversions and increasing revenue, so generally A/B testing is the one of many content optimization approaches

that has a strong mathematical background and easy understandable as well.  

Three main phases can be recognized for A/B testing life cycle:

1. A/B Test planning and preparing;

2. A/B Test implementing;

3. Analyzing results of a test, applying changes; 

Statistics theory allows to use hypothesis testing background for calculating significance level and making provably best choice of one of alternatives.  

A hypothesis should be formulated explicitly, like "Conversion Rate of Variant A is not equal to  Conversion Rate of Variant", this is so called null hypothesis, 

Alternative hypothesis completes null hypothesis to full state, so A/B test allows to accept or reject one of hypotheses based on observations data.

Planning for an experiment

Choose an experiment location

The operation answers to question  "Where the experiment will be performed ?".  Experiment location is URL(s) of page or group of pages will be included to testing A/B states.   

Location

Choose a page and define testing groups.

The original variant is called base line.

Formulating an experiment hypothesis

The implicit step means selecting goals and metrics for checking experiments results.

All goals should be reachable and metrics should be calculable for any variations included into experiment, other words all variations should be at equal conditions.  

Hypothesis testing implies using fully comparable results of alternatives.

Choose an experiment goals and metrics

Goal(s)

Often goal term means performing some action for a particular visitor. The action can be recognized as simple HTML event like button click, visiting to target page, adding item to basket etc,

the action can be also triggered by complex condition (page viewing duration), reaching funnel end, and many other advanced cases  (total amount of products sold, ....).

All tests have goals whose conversion rate you want to increase.   A test must have one or several goals.

Metric(s)

Unique Conversion rate for goal is ratio between number of all tested visitors and share of converted visitors who reached to selected goal(s).

Total Revenue is a sum of individual visitors revenues.

All metrics calculate for each variation of test and estimate their efficiency.    

Create alternative content for variations

Alternative contents can be created by various ways. 

Some sensitive questions of A/B testing


  • When should the test be stopped if the winner has been defined?
  • Is there enough traffic for conducting the test?
  • What will be the duration of the test?
  • Will there be the fall of the conversion during the test?
  • What if the traffic is nonhomogeneous?


Implementing experiment

Basics of a regular A/B testing approach performance

  1. The tested traffic is randomly sampled into portions among all variations.

  2. Within the experiment, the visitor interacts with only 1 variation among all.

  3. The experiment is being tested with the independent Bernoulli trials, where the target action accomplishment within some variation is called conversion (the successful case).

Traffic splitting flow

For any type of distribution, we should be able to define first how much of our traffic we wish to engage in the test.  A general A/B test may need to split traffic according to several tiers, these tiers are following:

1- tier is splitting between Test and Not Test traffic (many clients expect to test some share of total page traffic)

2 -tier is splitting among testing segments (applied only for segmentation, some clients want to test traffic from specific segment, like 'USA traffic' or 'UK traffic' based on GEO location attributes or other ones) 

3 -tier is splitting among testing content (variations)

4 -tier is splitting between different methods/strategies of content optimization (ML model vs. Bayesian bandits or Regular approach vs. Intelligent conversion approach for example)

for regular testing approach 4 tier is not applicable.


General scheme for traffic splitting.


Let's concern the different approaches for splitting traffic among content variations: 

  1. Evenly (uniform) distribution, each variation gets equal share of traffic, and this share does not depend on any conversion metrics (widely used under regular A/B testing)
  2. Manual distribution, here the shares of traffic for each served variations are not equal
  3. Automatic or adaptive distribution, where the shares of traffic can change during testing. This option is used for conversion optimization mode.



Evenly (uniform) distribution

In this type of distribution, traffic is split randomly and evenly distributed between the different versions of the page to determine which variation performs the best.

Adaptive distribution

The system starts with distributing evenly the different variants.

Progressively, the system will start giving more weight to the variant having the best conversion rate.

Segmentation

Segmentation is approach for differentiate testing depends a Visitor or other conditions (GEO location, device , time or smth else)

For each  segment we can assign a special content distribution.  he system will then distribute the variant according to some user traits.

In that case, the marketer should be able to choose between the evenly distribution (default) and the weighted distribution.

Start and implement an experiment

Publish

Once the test set up, it must be published so the end users can start experimenting the different variants

Gathering data

Special definitions

Some specific definitions for A/B test exist:

1. Unique occurrences are uniquely generated variations for visitors. A/B testing system should keep in memory what variation was displayed for a particular visitor. 

2. Unique conversions are uniquely observed goals that reached by visitors after testing content have been seen. 

A/B testing system mostly operates in occurrences / conversions space.

Data collections

Every time that variant is distributed to a end-user, the system must collect the following data:

User/session data:

  • Timestamps
  • Browser
  • Device
  • Screen Resolution and orientation
  • Geo Location
  • User segment (if applicable)
  • Session number
  • Additional attributes (available from JS environment)
  • Traffic source and referral  parameters (Source: Campaign, Direct, Referral, Search)
  • Cookies from third parties
  • Calculated attributes from marketing needs 

Test /variation/goal data:

  • Variant displayed (occurrences)
  • Goal reached (conversions)
  • If one of the goal is reached, then the above data must be enriched with that specific goal
  • If the goal is linked to a custom data (like the product price * quantity), that information must be sent too

Calculating for test winner (stopping criteria, statistical conclusions and reasoning )


Nowadays, the modern testing tools do not define anymore the test sample size and duration. Instead, they use different approaches determining, on the fly, whether a test is conclusive or not.

Regular or Frequentist Statistics 

This approach means using t-criteria for calculating probability to detect difference under selected confidence level.

The basics formulas for computing conversion rate and critical values for hypothesis testing.


Bayesian approach

Bayesian A/B Testing employs Bayesian inference methods to give you ‘probability’ of how much A is better (or worse) than B.

The immediate advantage of this method is that we can understand the result intuitively even without a proper statistical training. This means that it’s easier to communicate with business stakeholders.

Another advantage is that you don’t have to worry too much about the test size when you evaluate the result. You can start evaluating the result from the first day (or maybe even the first hour) by reading the probability of which one between A and B is better than the other. Of course, it would be better to have enough data size, but it’s much better to be able to say, for example, “A is better than B with 60% probability” than “We don’t have enough data yet.” And you can decide if you want to wait longer or not at any time.

In Bayesian reasoning, the fundamental goal is to compute a posterior distribution on conversion rate. Posterior distribution depends on prior distribution and likelihood this is fundamental relationship called as Bayes rule.

Actually prior represents is our beliefs before we have gathered any evidence and different stat models can be applied here.



A Bayesian approach to analysis of AB tests has many important advantages compared to approaches for estimating statistical significance.

It can often enable you to draw useful inferences, even where conversion rates and sample sizes are low.

A weak signal – if that is all you have – is enough for some marketing decisions – you can make your own decisions about the level of confidence you need based on the business situation.

The use cases where Bayesian approach will be preferable:

  • conversions too rare to reach significance
  • low traffic
  • optimizing for a smaller segment (mobile trafic or other)

The basics formulas for making Bayesian approach calculations

Adaptive or Machine Learning based methods

This approach does not mean stopping it proposes continuous optimization instead regular or sequential A/B testing.

Here the following options are available:

  1. Simple adaptive approaches (multi-armed bandits, e-greedy, softmax, UCB1 algorithm)
  2. Advanced adaptive like Bayesian multi-armed bandits
  3.  ML based methods based on testing attributes and observed data learning model (very wide range from linear regression to deep learning neural networks)

in any case adaptive model fits the best performing content for a particular visitor or segment of visitors.

Analyze the testing results

The reporting dimensions and measures

As for analytic system we need to represent the special dimensions and measures, create regular taxonomy.

The dimensions and measures can be observed or calculated based on other.

1.User related attributes

2.Web session (visit) attributes

3.Content related attributes

4.Experiment related attributes

5.Calculated attributes

6.Measures (observed and calculated)

The best approach is creating system for real-time  "slice and dice"  analytics for investigating base measures like conversion rate in different projections.  

The base reports

At any time of the test, we should be able to display a clear state of the state to the business stakeholders.

There are several groups of reports:

  1. A brief results for current time (dashboard)
  2. Variations performance report with metrics explanation at the  current Time(cumulative), it is comparing performance of experiment variations
  3. Variations performance report over Time  (data drill down by time dimension)
  4. Segment performance report (+over Time)
  5. Methods (testing) performance report (+ over Time)

All reports should display difference/uplift/gain/improvement around testing entities by selecting metrics and level of confidence of presented value. 

Confidence interval: The confidence interval measures uncertainty around improvement

Statistical significance represents that likelihood that the difference in conversion rates between a given variation and the baseline is not due to chance.

Statistical significance level reflects your risk tolerance and confidence level.

The typical report should show at least the following info:

  • The variant (with a link to display it)
  • The number of visitors (and unique visitor)
  • The conversion number and rate
  • A confidence indicator
  • An explicit message showing the level of confidence about the success of the test

The example or VWO reporting


The example of Optimizely reporting:



Report should have a segmenting capabilities for getting analytics projections for special attributes.


The business users are the only one who can decide to stop a test and to decided whether the winning variant should be applied or not.


Resolution

If the business stakeholder decides that the test is conclusive enough, he should be able to select the winning variant and promote it.

  • No labels