Tips for A/B Testing with R

22.11.2017 09:19

Which layout of an advertisement leads to more clicks? Would a different color or position of the purchase button lead to a higher conversion rate? Does a special offer really attract more customers – and which of two phrasings would be better?

For a long time, people have trusted their gut feeling to answer these questions. Today all these questions could be answered by conducting an A/B test. For this purpose, visitors of a website are randomly assigned to one of two groups between which the target metric (i.e. click-through rate, conversion rate…) can then be compared. Due to this randomization, the groups do not systematically differ in all other relevant dimensions. This means: If your target metric takes a significantly higher value in one group, you can be quite sure that it is because of your treatment and not because of any other variable.

In comparison to other methods, conducting an A/B test does not require extensive statistical knowledge. Nevertheless, some caveats have to be taken into account.

When making a statistical decision, there are two possible errors (see also table 1): A Type I error means that we observe a significant result although there is no real difference between our groups. A Type II error means that we do not observe a significant result although there is, in fact, a difference. The Type I error can be controlled and set to a fixed number in advance, e.g., at 5%, often denoted as α or the significance level. The Type II error, in contrast, cannot be controlled directly. It decreases with the sample size and the magnitude of the actual effect. When, for example, one of the designs performs way better than the other one, it’s more likely that the difference is actually detected by the test in comparison to a situation where there is only a small difference with respect to the target metric. Therefore, the required sample size can be computed in advance, given α and the minimum effect size you want to be able to detect (statistical power analysis). Knowing the average traffic on the website you can get a rough idea of the time you have to wait for the test to complete. Setting the rule for the end of the test in advance is often called “fixed-horizon testing”.

Table 1: Overview over possible errors and correct decisions in statistical tests
Effect really exists
No Yes
Statistical test is significant No True negative Type II error (false negative)
Yes Type I error (false positive) True positive

Statistical tests generally provide the p-value which reflects the probability of obtaining the observed result (or an even more extreme one) just by chance, given that there is no effect. If the p-value is smaller than α, the result is denoted as “significant”.

When running an A/B test you may not always want to wait until the end but take a look from time to time to see how the test performs. What if you suddenly observe that your p-value has already fallen below your significance level – doesn’t that mean that the winner has already been identified and you could stop the test? Although this conclusion is very appealing, it can also be very wrong. The p-value fluctuates strongly during the experiment and even if the p-value at the end of the fixed-horizon is substantially larger than α, it can go below α at some point during the experiment. This is the reason why looking at your p-value several times is a little bit like cheating, because it makes your actual probability of a Type I error substantially larger than the α you chose in advance. This is called “α inflation”. At best you only change the color or position of a button although it does not have any impact. At worst, your company provides a special offer which causes costs but actually no gain. The more often you check your p-value during the data collection, the more likely you are to draw wrong conclusions. In short: As attractive as it may seem, don’t stop your A/B test early just because you are observing a significant result. In fact you can prove that if you increase your time horizon to infinity, you are guaranteed to get a significant p-value at some point in time.

The following code simulates some data and plots the course of the p-value during the test. (For the first samples which are still very small R returns a warning that the chi square approximation may be incorrect.)


# Choose parameters:
pA <- 0.06 # True click through rate for group A
pB <- 0.06 # True click through rate for group B
nA <- 500 # Number of cases for group A
nB <- 500 # Number of cases for group B
alpha <- 0.05 # Significance level

# Simulate data:
data <- data.frame(group = rep(c("A", "B"), c(nA, nB)),
                   timestamp = sample(seq(as.timeDate('2016-06-02'),
                                          as.timeDate('2016-06-09'), by = 1), nA+nB),
                   clickedTrue = as.factor(c(rbinom(n = nA, size = 1, prob = pA),
                                             rbinom(n = nB, size = 1, prob = pB))))

# Order data by timestamp
data <- data[order(data$GMT.x..i..), ]
levels(data$clickedTrue) <- c("0", "1")

# Compute current p-values after every observation:
pValues <- c()
index <- c()
for (i in 50:dim(data)[1]){
  presentData <- table(data$group[1:i], data$clickedTrue[1:i])
  if (all(rowSums(presentData) > 0)){
    pValues <- c(pValues, prop.test(presentData)$p.value)
    index <- c(index, i)}
results <- data.frame(index = index,
                      pValue = pValues)

# Plot the p-values:
ggplot(results, aes(x = index, y = pValue)) +
  geom_line() +
  geom_hline(aes(yintercept = alpha)) +
  scale_y_continuous(name = "p-value", limits = c(0,1)) +
  scale_x_continuous(name = "Observed data points") +
  theme(text = element_text(size=20))

The figure below visualises a simulation with 500 observations and true rates of 6% in both groups, i.e., there is no actual difference. The line shows what you would observe if you looked at the p-value after each new observation. In the end (after 1000 observations), the p-value does the right thing: It is larger than 0.05, which is good as there is, in fact, no difference between the groups. But you can see that the p-value nevertheless crossed the threshold several times. By stopping this test early, it would have been very likely to draw a wrong conclusion.
This is of course only one possible result from a random simulation, but it makes the point clear. If you look at the p-value again and again and stop the experiment as soon as p < 0.05, your Type I error increases and becomes larger than 5%.

Example for the course of a p-value for two groups with no actual difference
Example for the course of the p-value during a test

The following code shows you how to test the difference between two rates in R, e.g., click-through rates or conversion rates. You can apply the code to your own data by replacing the URL to the example data with your file path. To test the difference between two proportions, you can use the function prop.test which is equivalent to Pearson’s chi-squared test. For small samples you should use Fisher’s exact test instead. Prop.test returns a p-value and a confidence interval for the difference between the two rates. The interpretation of a 95% confidence interval is as follows: When conducting such an analysis many times, then 95% of the displayed confidence intervals would contain the true difference. Afterwards you can also take a look at the fluctuations of the p-value during the tests by using the code from above.


# Specify file path:
dataPath <-

# Read data
data <- read_csv(file = dataPath) 

# Inspect structure of the data
## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  3 variables:
##  $ group      : chr  "A" "A" "A" "B" ...
##  $ time       : POSIXct, format: "2016-06-02 02:17:53" "2016-06-02 03:03:54" ...
##  $ clickedTrue: int  0 0 1 0 0 0 0 0 0 0 ...

# Change the column names
names(data) <- c("group", "time", "clickedTrue") 

# Change type of group to factor 
data$group <- as.factor(data$group) 

# Change type of click through variable to factor
data$clickedTrue <- as.factor(data$clickedTrue) 
levels(data$clickedTrue) <- c("0", "1")

# Compute frequencies and conduct test for proportions 
# (Frequency table with successes in the first column)
freqTable <- table(data$group, data$clickedTrue)[, c(2,1)] 

# print frequency table
##       1   0
##   A  20 480
##   B  40 460

# Conduct significance test
prop.test(freqTable, conf.level = .95)

##  2-sample test for equality of proportions with continuity
##  correction
## data:  freqTable
## X-squared = 6.4007, df = 1, p-value = 0.01141
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.071334055 -0.008665945
## sample estimates:
## prop 1 prop 2 
##   0.04   0.08

There are some more pitfalls, but most of them can easily be avoided. First, as a counterpart of stopping your test early because of a significant result, you could gather more data after the planned end of the test because the results have not yet become significant. This would likewise lead to an α inflation. A second, similar problem arises when running several tests at once: The probability to achieve a false-positive result would then be α for each of the tests. The overall probability that at least one of the results is false-positive is much larger. So always keep in mind that some of the significant results may have been caused by chance. Third, you can also get into trouble when you reach the required sample size very fast and stop the test after a few hours already. You should always consider that the behavior of the users in this specific time slot might not be representative for the general case. To avoid this, you should plan the duration of the test so that it covers at least 24 hours or a week when customers are behaving different at the weekend than on a typical workday. A fourth caveat concerns a rather moral issue: When users discover they are part of an experiment and suffer from disadvantages as a result, they might rightly become angry. (This problem will probably not arise due to a different-colored button, but maybe because of different prices or special offers.)

If you are willing to invest some more time, you may want to learn about techniques to avoid α inflation when conducting multiple tests or stopping your test as soon as the p-value crosses a certain threshold. In addition, there are techniques to include previous knowledge in your computations with Bayesian approaches. The latter is especially useful when you have rather small samples, but previous knowledge about the values that your target metric usually takes.

Go back