Understanding A/B Testing

Understanding A/B testing

A/B testing (also known as bucket tests or split-run testing) is a randomized experiment with two variants, A and B. It includes application of statistical hypothesis testing or  “two sample hypothesis” as used in the field of statistics. A/B testing is a way to compare two versions of a single variable, typically by testing a subject’s response to variant A against variant B, and determining which of the two variants is more effective.

The Use Case:

Let assume you are building a model for google to predict which ad to show for a search query. There is already a existing model M1 whose performance is good and you build a new model M2 and tested on test data (the data come from previous day data) and it is performing better than the old model (M1) in all analysis. But to be really sure whether new model will perform good in production or not we will do A/B testing

Approach:

Break the user in two group A and B ,let assume the split is 90:10 ,for group A you deploy a old model (M1),and for group B you deploy your new model (M2),we are splitting the user in 90:10 ratio because if our new model didn’t work good in production and large group of user will suffer. For Model Comparison we are using

CTR (Click Through Rate)= No. Of Click/No. Of Ad Shown.

If CTR (M2) > CTR (M1) we can say that new model is working better in predicting  which ad to show. We start our test with 90:10 split , if new Model performing better for couple of days then we will change  the split to let say 70:30 ,50:50 eventually when we will become very sure that our new model is performing good then we can fully implement our new model.

Problem:

The Problem with only Comparing CTR of both the model is , let assume in old model

No. OF click in M1= 100

No. Of Ad Shown in M1=10,000

CTR (M1) = 1%

No. Of Click in M2 = 2

No. Of Ad. Shown in M2 = 100

CTR (M2) = 2%

If we only look the CTR then CTR(M2) is Greater than CTR(M1) but no. of ad shown by M2 is also very less. So the Better Approach is to use Confidence Interval

The 95% Confidence Interval for CTR (M1) = [0.8% -1.2%]

The 95% Confidence Interval for CTR(M2) = [0.55% – 7%]

The Confidence Interval for M2 is very wide while the Confidence Interval for M1 is tightly bound.

The Confidence Interval value of M1 is fully overlapping with M2 so we cannot say that M2 is performing good although its CTR is high. So the best scenario will be if Confidence Interval Of both the model will not overlap and CTR (M2) > CTR(M1) then only we can say that M2 is performing well and we can deploy M2 in our production

Real World Application:

Google engineers ran their first A/B test in the year 2000 in an attempt to determine what the optimum number of results to display on its search engine results page would be. The first test was unsuccessful due to glitches that resulted from slow loading times. Later A/B testing research would be more advanced, but the foundation and underlying principles generally remain the same, and in 2011, 11 years after Google’s first test, Google ran over 7,000 different A/B tests

Reference : https://en.wikipedia.org/wiki/A/B_testing