A/B Testing – Statistical Experiments for the Web

0
872
13 min read

(For more resources related to this topic, see here.)

Defining A/B testing

At its most fundamental level, A/B testing just involves creating two different versions of a web page. Sometimes, the changes are major redesigns of the site or the user experience, but usually, the changes are as simple as changing the text on a button. Then, for a short period of time, new visitors are randomly shown one of the two versions of the page. The site tracks their behavior, and the experiment determines whether one version or the other increases the users’ interaction with the site. This may mean more click-through, more purchases, or any other measurable behavior.

This is similar to other methods in other domains that use different names. The basic framework randomly tests two or more groups simultaneously and is sometimes called random-controlled experiments or online-controlled experiments. It’s also sometimes referred to as split testing, as the participants are split into two groups.

These are all examples of between-subjects experiment design. Experiments that use these designs all split the participants into two groups. One group, the control group, gets the original environment. The other group, the test group, gets the modified environment that those conducting the experiment are interested in testing.


Experiments of this sort can be single-blind or double-blind. In single-blind experiments, the subjects don’t know which group they belong to. In double-blind experiments, those conducting the experiments also don’t know which group the subjects they’re interacting with belong to. This safeguards the experiments against biases that can be introduced by participants being aware of which group they belong to. For example, participants could get more engaged if they believe they’re in the test group because this is newer in some way. Or, an experimenter could treat a subject differently in a subtle way because of the group that they belong to.

As the computer is the one that directly conducts the experiment, and because those visiting your website aren’t aware of which group they belong to, website A/B testing is generally an example of double-blind experiments.

Of course, this is an argument for only conducting the test on new visitors. Otherwise, the user might recognize that the design has changed and throw the experiment away. For example, the users may be more likely to click on a new button when they recognize that the button is, in fact, new. However, if they are new to the site as a whole, then the button itself may not stand out enough to warrant extra attention.

In some cases, these subjects can test more variant sites. This divides the test subjects into more groups. There needs to be more subjects available in order to compensate for this. Otherwise, the experiment’s statistical validity might be in jeopardy. If each group doesn’t have enough subjects, and therefore observations, then there is a larger error rate for the test, and results will need to be more extreme to be significant.

In general, though, you’ll want to have as many subjects as you reasonably can. Of course, this is always a trade-off. Getting 500 or 1000 subjects may take a while, given the typical traffic of many websites, but you still need to take action within a reasonable amount of time and put the results of the experiment into effect. So we’ll talk later about how to determine the number of subjects that you actually need to get a certain level of significance.

Another wrinkle that is you’ll want to know as soon as possible is whether one option is clearly better or not so that you can begin to profit from it early. In the multi-armed bandit problem, this is a problem of exploration versus exploitation. This refers to the tension in the experiment design (and other domain) between exploring the problem space and exploiting the resources you’ve found in the experiment so far. We won’t get into this further, but it is a factor to stay aware of as you perform A/B tests in the future.

Because of the power and simplicity of A/B testing, it’s being widely used in a variety of domains. For example, marketing and advertising make extensive use of it. Also, it has become a powerful way to test and improve measurable interactions between your website and those who visit it online.

The primary requirement is that the interaction be somewhat limited and very measurable. Interesting would not make a good metric; the click-through rate or pages visited, however, would. Because of this, A/B tests validate changes in the placement or in the text of buttons that call for action from the users. For example, a test might compare the performance of Click for more! against Learn more now!. Another test may check whether a button placed in the upper-right section increases sales versus one in the center of the page.

These changes are all incremental, and you probably don’t want to break a large site redesign into pieces and test all of them individually. In a larger redesign, several changes may work together and reinforce each other. Testing them incrementally and only applying the ones that increase some metric can result in a design that’s not aesthetically pleasing, is difficult to maintain, and costs you users in the long run. In these cases, A/B testing is not recommended.

Some other things that are regularly tested in A/B tests include the following parts of a web page:

  • The wording, size, and placement of a call-to-action button

  • The headline and product description

  • The length, layout, and fields in a form

  • The overall layout and style of the website as a larger test, which is not broken down

  • The pricing and promotional offers of products

  • The images on the landing page

  • The amount of text on a page

Now that we have an understanding of what A/B testing is and what it can do for us, let’s see what it will take to set up and perform an A/B test.

Conducting an A/B test

In creating an A/B test, we need to decide several things, and then we need to put our plan into action. We’ll walk through those decisions here and create a simple set of web pages that will test the aspects of design that we are interested in changing, based upon the behavior of the user.

Before we start building stuff, though, we need to think through our experiment and what we’ll need to build.

Planning the experiment

For this article, we’re going to pretend that we have a website for selling widgets (or rather, looking at the website Widgets!).

The web page in this screenshot is the control page. Currently, we’re getting 24 percent click-through on it from the Learn more! button.

We’re interested in the text of the button. If it read Order now! instead of Learn more!, it might generate more click-through. (Of course, actually explaining what the product is and what problems it solves might be more effective, but one can’t have everything.) This will be the test page, and we’re hoping that we can increase the click-through rate to 29 percent (a five percent absolute increase).

Now that we have two versions of the page to experiment with, we can frame the experiment statistically and figure out how many subjects we’ll need for each version of the page in order to achieve a statistically meaningful increase in the click-through rate on that button.

Framing the statistics

First, we need to frame our experiment in terms of the null-hypothesis test. In this case, the null hypothesis would look something like this:

Changing the button copy from Learn more! to Order now! Would not improve the click-through rate.

Remember, this is the statement that we’re hoping to disprove (or fail to disprove) in the course of this experiment.

Now we need to think about the sample size. This needs to be fixed in advance. To find the sample size, we’ll use the standard error formula, which will be solved to get the number of observations to make for about a 95 percent confidence interval in order to get us in the ballpark of how large our sample should be:

In this, δ is the minimum effect to detect and σ² is the sample variance. If we are testing for something like a percent increase in the click-through, the variance is σ² = p(1 – p), where p is the initial click-through rate with the control page.

So for this experiment, the variance will be 0.24(1-0.24) or 0.1824. This would make the sample size for each variable 16(0.1824 / 0.052) or almost 1170.

The code to compute this in Clojure is fairly simple:

(defn get-target-sample [rate min-effect] (let [v (* rate (- 1.0 rate))] (* 16.0 (/ v (* min-effect min-effect)))))

Running the code from the prompt gives us the response that we expect:

user=> (get-target-sample 0.24 0.05) 1167.36

Part of the reason to calculate the number of participants needed is that monitoring the progress of the experiment and stopping it prematurely can invalidate the results of the test because it increases the risk of false positives where the experiment says it has disproved the null hypothesis when it really hasn’t.

This seems counterintuitive, doesn’t it? Once we have significant results, we should be able to stop the test. Let’s work through it.

Let’s say that in actuality, there’s no difference between the control page and the test page. That is, both sets of copy for the button get approximately the same click-through rate. If we’re attempting to get p ≤ 0.05, then it means that the test will return a false positive five percent of the time. It will incorrectly say that there is a significant difference between the click-through rates of the two buttons five percent of the time.

Let’s say that we’re running the test and planning to get 3,000 subjects. We end up checking the results of every 1,000 participants. Let’s break down what might happen:

Run

A

B

C

D

E

F

G

H

1000

No

No

No

No

Yes

Yes

Yes

Yes

2000

No

No

Yes

Yes

No

Yes

No

Yes

3000

No

Yes

No

Yes

No

No

Yes

Yes

Final

No

Yes

No

Yes

No

No

Yes

Yes

Stopped

No

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Let’s read this table. Each lettered column represents a scenario for how the significance of the results may change over the run of the test. The rows represent the number of observations that have been made. The row labeled Final represents the experiment’s true finishing result, and the row labeled Stopped represents the result if the experiment is stopped as soon as a significant result is seen.

The final results show us that out of eight different scenarios, the final result would be significant in four cases (B, D, G, and H). However, if the experiment is stopped prematurely, then it will be significant in seven cases (all but A). The test could drastically over-generate false positives.

In fact, most statistical tests assume that the sample size is fixed before the test is run.

It’s exciting to get good results, so we’ll design our system so that we can’t easily stop it prematurely. We’ll just take that temptation away.

With this in mind, let’s consider how we can implement this test.

Building the experiment

There are several options to actually implement the A/B test. We’ll consider several of them and weigh their pros and cons. Ultimately, the option that works best for you really depends on your circumstances. However, we’ll pick one for this article and use it to implement the test for it.

Looking at options to build the site

The first way to implement A/B testing is to use a server-side implementation. In this case, all of the processing and tracking is handled on the server, and visitors’ actions would be tracked using GET or POST parameters on the URL for the resource that the experiment is attempting to drive traffic towards.

The steps for this process would go something like the following ones:

  1. A new user visits the site and requests for the page that contains the button or copy that is being tested.

  2. The server recognizes that this is a new user and assigns the user a tracking number.

  3. It assigns the user to one of the test groups.

  4. It adds a row in a database that contains the tracking number and the test group that the user is part of.

  5. It returns the page to the user with the copy, image, or design that is reflective of the control or test group.

  6. The user views the returned page and decides whether to click on the button or link or not.

  7. If the server receives a request for the button’s or link’s target, it updates the user’s row in the tracking table to show us that the interaction was a success, that is, that the user did a click-through or made a purchase.

This way of handling it keeps everything on the server, so it allows more control and configuration over exactly how you want to conduct your experiment.

A second way of implementing this would be to do everything using JavaScript (or ClojureScript, https://github.com/clojure/clojurescript). In this scenario, the code on the page itself would randomly decide whether the user belonged to the control or the test group, and it would notify the server that a new observation in the experiment was beginning. It would then update the page with the appropriate copy or image. Most of the rest of this interaction is the same as the one in previous scenario. However, the complete steps are as follows:

  1. A new user visits the site and requests for the page that contains the button or copy being tested.

  2. The server inserts some JavaScript to handle the A/B test into the page.

  3. As the page is being rendered, the JavaScript library generates a new tracking number for the user.

  4. It assigns the user to one of the test groups.

  5. It renders that page for the group that the user belongs to, which is either the control group or the test group.

  6. It notifies the server of the user’s tracking number and the group.

  7. The server takes this notification and adds a row for the observation in the database.

  8. The JavaScript in the browser tracks the user’s next move either by directly notifying the server using an AJAX call or indirectly using a GET parameter in the URL for the next page.

  9. The server receives the notification whichever way it’s sent and updates the row in the database.

The downside of this is that having JavaScript take care of rendering the experiment might take slightly longer and may throw off the experiment. It’s also slightly more complicated, because there are more parts that have to communicate. However, the benefit is that you can create a JavaScript library, easily throw a small script tag into the page, and immediately have a new A/B experiment running.

In reality, though, you’ll probably just use a service that handles this and more for you. However, it still makes sense to understand what they’re providing for you, and that’s what this article tries to do by helping you understand how to perform an A/B test so that you can be make better use of these A/B testing vendors and services.

LEAVE A REPLY

Please enter your comment!
Please enter your name here