A/B testing is another fancy name for controlled experimentation. Controlled experimentation is an age-old method, but the name “A/B split testing” is relatively new. In A/B testing, the “control” and “variation” websites receive equal traffic.
Even though there is a lot of information about A/B tests, many people still do them wrong and test the wrong variables.
What exactly is the A/B test?
A/B testing is a type of experimentation in which two or more versions (A and B) are put against each other to see which works better.
“Split tests” are what researchers use to determine whether or not new drugs are effective. The majority of research experiments are “split tests,” which have a hypothesis, a control, a variation, and a statistically calculated result.
For example, if you were to conduct a straightforward A/B test, for instance, the original page and a variant would each receive an equal number of website visitors:
The key distinction lies in the unpredictable nature of internet traffic for conversion optimization. In a laboratory setting, it is much easier to control all of the outside factors. There are ways to get around the internet, but it takes time to conduct a test entirely under control.
In addition, testing new medications calls for an almost unrivaled degree of precision. There is a risk to people’s lives. Technically, you can extend your exploration period because you don’t want to make a Type I error (also called a false positive).
The A/B split-testing procedure considers the business’s overall objectives when conducted online. It looks at the pros and cons of scientific research vs. business goals and exploration vs. exploitation. Because of this, we look at the results differently and come to different conclusions than people who do studies in a lab.
There is no limit to the number of permutations of A/B tests. A/B/n tests are the common name for tests that include more than two different versions. You can test as many different versions as you want, as long as you have enough traffic. Here’s an example of an A/B/C/D test, along with how much traffic is sent to each version:
In an A/B/n test, traffic is divided equally between the control page and other page variations.
A/B/n tests are an excellent way to generate additional variants of the same hypothesis. But because they are spread out over a larger number of pages, they need a higher number of visitors.
When to do the multivariate test?
Suppose you want to understand how interactions have an impact. Due to interaction effects, a multi-change A/B test cannot be the winner. Because the new hero photo draws attention to a different area on the page, a new solid headline can go overlooked. Do an MVT by leaving in and out the existing elements if you want to find out quickly which elements on your page have an impact.
Even though they are the most common, A/B tests are only one kind of experiment that may be conducted online. In addition to that, A/B tests can do multivariate and bandit tests.
What’s the Difference Between A/B Testing, Multivariate Testing, and Bandit Algorithms?
Tests using the A/B/n methodology are controlled studies that compare one or more modifications to the page that served as the baseline. The results compare the rates of conversion for each version based on a single change.
Multiple iterations of a page are put through multivariate testing to determine which aspects of the page have the most significant influence. In other words, multivariate tests are similar to A/B/n tests in that they compare the original to different versions. The difference is that each version has a different set of design elements.
You can get the most out of your website by making the most of each piece with its unique impact and use case. How to do it:
- Utilize A/B testing to identify which layouts perform the best.
- Multivariate testing should be used when polishing layouts and ensuring all aspects interact smoothly with one another.
Before you can even think about doing multivariate testing, the page you want to look at needs to get a lot of traffic. On the other hand, if you get enough traffic, you should use both types of testing as part of your optimization plan.
Most organizations give A/B testing a higher priority because, on average, you are testing bigger changes that could have a bigger effect and it is easier to do. According to Peep Laja (founder of CXL & Speero), who previously remarked, “Most top agencies that I’ve talked to about this perform around 10 A/B tests for each MVT.”
Bandit algorithms are A/B/n tests where results are updated in real-time based on how well each variant performed.
In its most basic form, a bandit algorithm works by first directing users’ traffic to two or more distinct web pages, known as the “original” and “variant” (s). The system then makes changes and deploys based on which variant is “winning” automatically and parallelly. In the end, the algorithm will fully exploit the optimal choice.
Bandit testing has several advantages, one of which is that it reduces the likelihood of experiencing “regret,” also known as the “lost conversion opportunity” that occurs when testing a variant that might potentially be worse. This figure from Google provides a clear explanation of that:
Both A/B/n tests and bandits serve a specific function. In general, bandits are lovely for the following:
- Headlines and limited-duration campaigns
- Scaling automation
- Targeting
- Blending optimization and attribution.
It does not matter what kind of test you are running; a structured process raises the bar for your chances of passing and is essential, including running more tests, winning more tests, and lifting conversion rates.
How can we enhance the outcomes of the A/B test?
Ignore any blog postings that advise you to do “99 A/B tests to do right now to improve conversion rates”. You can make better use of both your time and traffic flow. Increase your revenues by implementing a process.
74% of optimizers who use a systematic approach say that their work has led to more revenues coming in. Those who do not employ a systematic strategy remain in what author Craig Sullivan refers to as the “Trough of Disillusionment.” (That is unless their findings are riddled with false positives, a topic that will be covered in a later section.)
A winning structure may be broken down as follows:
- Research
- Prioritization
- Experimenting
- Analyzing and practicing what was learned
1. Research: Obtaining Insights That Are Driven by Data
Before you can start optimizing, you need to understand what your users are doing and why they are doing it. However, before you worry about things like optimization and testing, you should first ensure that your high-level plan is sound and then work your way down. Consider the following in this order:
- Establish your company’s goals and priorities.
- Determine the objectives of your website.
- Determine which metrics will serve as your Key Performance Indicators.
- Establish your desired measures of success.
Once you’ve decided where you want to travel, you can gather the information you’ll need to get there. In order to accomplish this, I suggest using the following framework.
The following is an overview of the structured procedure that I’ve been practicing for the last 5+ years:
- Heuristic analysis
- Technical analysis
- Web analytics analysis
- Mouse-tracking analysis
- Qualitative survey research
- User testing
- Copy testing
The heuristic analysis comes closest to describing “best practices.” Even after spending a significant amount of time on experimentation, an optimizer finds some things are impossible to predict with certainty. However, you can locate potential areas of opportunity. According to the words of Craig Sullivan:
“My experience in observing and fixing things: These patterns do make me a better diagnostician, but they don’t function as truths—they guide and inform my work, but they don’t provide guarantees.”
Craig Sullivan in the CXL blog
It is essential to have humility. Having a structure is also helpful in this regard. When doing heuristic analysis, we evaluate each page by considering the factors listed below:
- Relevancy
- Clarity
- Value
- Friction
- Distraction
Technical analysis is an area that needs to be more frequently addressed. Bugs represent a significant barrier to conversion; they are the conversion killers. There might be nothing wrong with the user experience or the operation of your website. However, does it function to the same high standard across all browsers and devices? Almost certainly not.
These bugs are a fruit that is easy to reach and can bring in a lot of money. Therefore, to begin:
- Doing tests on several browsers and devices simultaneously.
- Performing a speed analysis right now.
Web analytics: The examination of web analytics comes next. Check to see that everything is operating correctly. (You’d be amazed by the number of analytics setups that need to be fixed.)
Since Google Analytics and other analytics installations are a whole subject on their own, I’ll give you some links that might be helpful:
The click maps, scroll maps, heat maps, form analytics, and user session replays are all parts of the mouse-tracking analysis. Be careful not to focus on the attractive visualizations of click maps. Ensure that this step contributes to the broader goals that you have set for yourself.
Qualitative research answers the question “why,” whereas quantitative analysis leaves it unanswered. Many people think that qualitative research is “easier” or “kinder” than quantitative research. However, qualitative research should be just as rigorous and give just as important insights as quantitative research.
When conducting qualitative research, you may employ items such as:
- On-site surveys
- Polling of the customers
- Interviews with customers as well as focus groups
The next step is called user testing. The principle is straightforward. Have real people use and engage with your website while you observe, and ask them to talk out loud about their thought process while they do so. Please pay close attention to what they have to say and the things that they go through.
With copy testing, you may learn how your actual target audience understands the content, what they find clear or confusing, and what arguments they care about or do not care about by using copy testing.
After conducting extensive conversion studies, you will have a lot of data. The following thing to do is to rank that data to be tested.
2. Prioritization: which of the hypotheses from an A/B test be prioritized?
Many different structures are available to help you prioritize your A/B testing, and you could even come up with your own algorithm to innovate. Presented below is a method via which Craig Sullivan suggests prioritizing tasks.
After going through all six processes, you will discover problems, some of which are major and others minor. Place each item discovered in one of the following five buckets:
- Test – This bucket is where you put things you will test.
- Instrument – This may entail adding, fixing, improving, or refining the handling of tags and events in analytics.
- Hypothesize – You have arrived at this point because you discovered a page, widget, or procedure that is not functioning correctly but for which there is no apparent clear solution.
- Fix it now – Get it done. The receptacle for obvious choices is located here. Just get it done.
- Investigate – If anything is in this bucket, you must ask further questions or look into the matter more.
Give each issue a rating between one and five stars, with one representing a somewhat minor problem and five representing a very urgent one. When assigning a score, two factors are considered to be more significant than any other:
- Ease of implementation, taking into account time, complexity, and risk. There will be instances when the data will direct you to construct a feature that will take several months to complete. Do not begin with that.
- Opportunity. Issues are scored on a more subjective scale depending on how significant of a change or lift they may cause.
Make a spreadsheet including all of your information. You will have a testing plan that is ranked in order of importance.
To eliminate the element of subjectivity, folks at CXL developed their prioritizing methodology called the PXL model. It is dependent on the necessity of bringing information to the table.
PXL framework
Grab a copy of this spreadsheet template for your use right here. To make it your own, go to the “File” menu and select “Make a Copy.”
This framework asks you a series of questions about it so that you don’t have to speculate about what the impact may be:
- Is the change visible without having to scroll down? More individuals notice the updates that appear above the fold. Therefore, the likelihood of that alteration having an effect has increased.
- Is there a noticeable difference in less than five seconds? Display the control to a set of individuals, then switch to the variant (s). After only five seconds, are they able to distinguish between the two? In that case, it may have less of an influence.
- Does it take anything away or offer something new? More extensive modifications, such as getting rid of distractions or adding important information, typically have a more significant effect.
- Does the test perform properly on pages with a lot of traffic? When you change a page that gets a lot of traffic, you get a greater return on your investment.
You will need data on a wide variety of prospective test factors to rank your hypotheses. Participating in weekly talks with your team & cross-functional teams can assist you in prioritizing testing based on the facts rather than your own opinions:
- Is it a solution to a problem that issues uncovered through user testing?
- Is it a solution to a problem identified through qualitative input (surveys, polls, interviews)?
- Is there evidence to back up the notion, such as eye tracking, heat maps, or mouse tracking?
- Does it make use of the insights discovered through digital analytics?
We also determined the limits of the ease of implementation by bracketing the responses based on the projected amount of time. In an ideal scenario, a test developer would participate in conversations about the order of priorities.
Grading PXL
Here in PXL, the scale operates under the assumption of a binary scale, on which you must select either one or zero. So, for most variables, you must choose either a zero or a one (unless it says otherwise).
Sometimes we also want to weight variables based on relevance, which means considering how noticeable a change is, whether or not something is added or eliminated, and how easy it is to execute. When it comes to these variables, I will explain in detail how the circumstances shift. For example, when it comes to the noticeability of the change variable, you may either put a two or a zero in the corresponding box.
Customizability
PXL model was developed with the philosophy that you should be able to adjust the variables depending on what is essential to your company and that you should do so.
Hypotheses, for instance, need to be per brand requirements if you’re working on a project with a team that focuses on user experience or branding. You should include it as a variable.
Perhaps you work for a new company that relies heavily on search engine optimization (SEO) to drive customer acquisition. Your funding relies on consistent SEO. Adding a category such as “doesn’t interfere with SEO” can change the results of several headlines or copy tests.
Different presumptions guide the operations of every organization. Personalizing the template to reflect these considerations will help your optimization process run more smoothly. Make sure that the structure you choose to employ is systematic and can be understood by all members of the team as well as any stakeholders.
3. Experimentation – How long should A/B testing be done?
First of all, even if a test has reached statistical significance, it should still be continued. This is probably the most common mistake that new optimizers make when they try to do their jobs right.
When testing is stopped as soon as significance is reached, you will discover that most lifts transfer to less significant revenue (which, after all, is the purpose). In reality, there was no such thing as the “lifts.”
Consider the following: When one thousand A/A tests, which compare two pages side by side, were carried out:
- 771 out of 1,000 tests attained the significance threshold of 90 percent at some time.
- 531 out of the 1000 experiments attained the significance threshold of 95% at some time.
Suppose you stop your testing at the significance level. In that case, you risk getting false positives and eliminate potential challenges to the external validity, such as seasonality.
A predetermined sample size should then be used. The experiment should be performed for a minimum of two complete business cycles.
How is the size of the sample determined in advance? There are several excellent tools available. With the help of Evan Miller’s tool, here is how you would compute the size of your sample:
In this instance, we told the tool that our conversion rate is 3% and that we wanted to identify an uplift of at least 10%. Before we can look into the levels of statistical significance, the tool says we need 51,486 visitors for each change.
In addition to the degree of significance, there is also something that is known as statistical power. Type II errors, also called “false negatives,” are less likely to happen when statistical power is used. Put another way, it increases the likelihood that you would notice an effect, provided that there was such an effect.
Know that 80% power is the benchmark for A/B testing tools since this information is helpful for practical applications. A significant sample size, a sizeable effect size, or a test that lasts significantly longer is required to achieve such a level.
There is no such thing as a sample size magic number. There are a lot of blog entries that use mystical figures as stopping places, such as 1000 visits or 100 conversions. Math isn’t a form of magic. Math is math, and the problem we’re trying to solve here is marginally more difficult than straightforward heuristics like those figures. To paraphrase Malwarebytes’ Andrew Anderson’s excellent explanation:
It is never about how many conversions. It’s about having enough data to validate based on representative samples and expected behavior.
One hundred conversions are possible in only the most remote cases and with an incredibly high delta in behavior, but only if other requirements like behavior over time, consistency, and normal distribution occur. Even then, there is a high chance of a Type I error, a false positive.
Andrew Anderson from Malwarebytes
We need a sample that is reflective of the whole. What are the steps to getting that? Examine the market throughout two economic cycles to account for the following external factors:
- The daily traffic you get might vary quite a bit on the given day of the week.
- The traffic sources (If you want to customize the experience for a specific source).
- Schedule for the publication of blog posts and newsletters.
- Follow-up guests. People may look at your website, consider making a purchase, and then return to make the transaction ten days later.
- External events. Paydays that fall in the middle of the month might impact spending, for instance.
Take caution while working with relatively tiny sample sizes. The internet is rife with case studies that are heavy on sloppy mathematics. If the numbers were looked at, most studies would show that publishers rated test variants based on a rise of 12 to 22 conversions for every 100 visits.
After ensuring everything is in place as it should be, you must refrain from looking at the test results early on or allowing your boss to do so. Because of this, it’s feasible to make an early prediction of the outcome by “identifying a pattern” (impossible). You’ll notice that many of the test results tend to return to regression to the mean.
Regression to the mean
During the first few days of a test, the results can sometimes change a lot for no clear reason. As expected, their answers tend to sound the same more and more as the test goes on. Look at the example below for ecommerce –
- Up the first couple of days, Blue (variant #3) is pulling in a significant lead, earning almost $16 per visitor compared to $12.50 for control. Many individuals would finish the test here, but they would need to do it correctly.
- After seven days, Blue is still in the lead, and the margin of victory is considerable.
- After 14 days: Orange (#4) is winning!
- After 21 days, Orange has maintained its lead!
- Nothing changed in the end.
If the test had been stopped after less than four weeks, the results would have been wrong.
The phenomenon known as the novelty effect is also relevant. The unfamiliarity of your adjustments (such as making the blue button larger, for example) draws more attention to the variant. The novelty of the adjustment wears off over time, which causes the boost to lose its effect.
It’s just one of the numerous challenges that come with doing A/B testing.
Can numerous A/B tests be performed at the same time?
You want to increase the pace of your testing so that you can do more tests—this is known as high-tempo testing. However, is it possible to do many A/B tests simultaneously? Will it improve your possibilities for growth, or will it contaminate your data?
Professionals have recommended that you refrain from taking numerous tests simultaneously. Others believe it is OK. In most circumstances, it will be fine for you to run many tests simultaneously. Severe interactions are rare.
The advantages of testing volume will generally exceed the noise in your data and the odd false positive, provided that you are not testing critical items (for example, anything that affects your business model or the firm’s future).
If there is a good chance that different tests may affect each other, run fewer tests at once and give each one more time to finish. This will give you more accurate results.
How do you set up an A/B test?
After you have a list of test ideas in order of importance, you can start making a hypothesis and running experiments. A hypothesis is a sentence that explains why you think something happened. In addition, a strong hypothesis would be:
- A hypothesis is testable. Because a hypothesis can be measurable, it can also be examined and evaluated.
- Your hypothesis should find a solution to the conversion difficulty. The results of the hypothesis may solve conversion issues by using split testing.
- It should provide market insights. Your split-testing findings may provide information about your clients, whether the test “wins” or “loses,” provided you have a well-articulated premise.
The approach was made more accessible by Craig Sullivan’s provision of a hypothesis kit:
- Since we saw (data and feedback),
- We anticipate that (change) will result in (impact).
- This is going to be measured using (data metric).
And the most complex one is:
- Because we examined qualitative and quantitative data
- We anticipate that (change) for (population) will result in (causes) (impact[s]).
- With time, we anticipate seeing (changes in data metrics) (X business cycles).
A/B testing tool selection
The most exciting part is when you finally get to the point where you can choose a tool.
Even though this is the first item many people think about, it’s not the most significant. A solid understanding of strategy and statistics comes first.
It is crucial to be aware of a few key distinctions. Testing tools can be divided into two broad categories: those that operate on the client side and those that operate on the server side.
Server-side tools are tools that run on the server and render code. They don’t change the viewer’s browser in order to show the user the random version of the page. Client-side tools send the same page, but JavaScript on the client’s browser changes how both the original and the variation look.
Optimizely, VWO, and Adobe Target are some examples of client-side testing tools that optimizers can use. Conductrics can do both functions, while SiteSpect offers a server-side proxy solution.
What exactly does all of this imply for you? Client-side tools can help you get up and running faster, which is helpful if you’d like to save time upfront, especially if your team is tiny or lacks many development resources. Server-side programming demands more resources for development, but it’s often more resilient.
Even though setting up tests can be very different depending on the tool you use, it is usually as easy as signing up for your preferred tool and following the instructions they give, such as adding a bit of JavaScript to your website.
Beyond that, you will need to make a list of your goals (to know when a conversion has been made). Your testing tool will keep track of the percentage of visitors that become paying clients for each variant.
When it comes to putting up A/B testing, having knowledge of HTML, CSS, and JavaScript/jQuery, as well as design and copywriting skills to construct variations, is beneficial. Using a visual editor, which some technologies let you do, makes it harder for you to change and gives you less control.
4. Analysis – How to examine the results of the A/B test
Alright. You have prepared thoroughly for the test, ensuring that it is perfectly set up and that it is now cooked. You are moving on to the analysis now. It is more challenging than taking a quick look at the graph that your testing tool generates.
One thing that you should always remember to do is to analyze the results of your tests using Google Analytics. Not only does it improve your ability to analyze data, but it also enables you to have greater faith in the conclusions you draw from that data and in the choices you make.
The testing tool you use may need to record the data accurately. You need alternative sources for your test data to know whether or not to believe your information. Establish a variety of different data sources.
What are the repercussions if there is no discernible difference between the variants? Take your time, and don’t rush anything. First, you need to be aware of two things:
While it’s possible that your hypothesis was correct, its implementation was incorrect.
- The results of your qualitative studies indicate that people are worried about their security & safety. How many ways are there to give people the impression that they are secure on the website? Unlimited.
- Testing in iterative fashion is the game now. So if you thought you were on to something, give a few iterations a shot.
Even if there was no difference between the two groups overall, the variation might still win one or two segments.
- Suppose you saw an increase in the number of returning visitors and mobile visitors but a decrease in the number of new visitors and desktop users. In that case, it’s possible that these two groups would balance each other out, giving the impression that there is “no difference.”
- To look into that possibility, conduct an analysis of your test across all crucial parts.
A segmentation of the data for the A/B tests
When it comes to learning from A/B testing, segmentation is essential. Even if B comes in behind A in the overall rankings, it could beat A in some categories (organic, Facebook, mobile, etc.).
There are a significant number of subparts for you to investigate. The following are some of the options that Optimizely lists:
- Type of browser;
- Kind of source;
- Mobile versus desktop, or based on the device;
- Visitors who are logged in as opposed to visitors who are not logged in;
- PPC/SEM campaign;
- Geographical regions (city, state/province, country);
- Comparison of first-time customers to those who have shopped here before;
- Power users, in contrast to occasional guests;
- Men vs. women;
- Age range;
- New leads against leads that have already been filed;
- Plan categories or membership tiers in a loyalty program;
- Current subscribers, prospective customers, and former customers;
- Roles (if your site contains, for instance, both a buyer and seller function)
Examine, at the very least—if you have a sufficient sample size—the following categories:
- Desktop versus tablet and mobile;
- First-time versus repeat visitors;
- Traffic that arrives directly on the page versus traffic that comes from other pages on the site
Check to see that you have a sufficient number of samples from each of the segments. Do the math ahead of time, and be careful if the result is less than 250–350 conversions per variation in a given segment.
Suppose your treatment was successful with a specific user group. In that case, it is time to develop a more individualized strategy for dealing with those customers.
How to store data from previous A/B experiments?
It’s not only about lifts, wins, and losses, and testing random shit when you’re doing A/B testing. Learning from statistically reliable A/B tests helps the larger aims of growth and optimization, as was stated by Matt Gershoff, who claimed that optimization is about “collecting knowledge to inform decisions.”
The smartest businesses keep track of test results and organize their testing plans in a way that makes sense. When optimizing a system, a structured approach results in faster growth. It is less likely to be constrained by local maxima.
So now we come to the tricky part. No one strategy is superior to all others for organizing your knowledge management. Some businesses use Excel and Trello, along with more complex tools made in-house and products bought from a third party.
If it’s of any use, here are three tools that leading firms developed expressly for the management of conversion optimization projects:
- Iridion (Now part of KonversionsKraft)
- Effective Experiments;
- Projects by Growth Hackers
I personally use Airtable for all my projects.
It’s important to keep communication lines open not only between departments but also between executives. The findings of an A/B test are only sometimes evident to the average individual. The ability to visualize is functional.
Tools for A/B testing
There are many different tools available for doing experiments online. The following are examples of some of the most widely used A/B testing tools:
- Optimizely
- VWO
- Adobe Target
- Maximyser, (and)
- Conductrics
- Google Optimize
A/B testing calculators-
- AB Test Calculator by CXL
- A/B Split Test Significance Calculator by VWO
- A/B Split and Multivariate Test Duration Calculator by VWO
- Evan Miller’s Sample Size Calculator.
AB Test Calculator by CXL; A/B Split Test Significance Calculator by VWO; A/B Split and Multivariate Test Duration Calculator by V
Conclusion
When making judgments online, A/B testing is an extremely helpful resource for all involved.
You can lessen the impact of many of the challenges that the majority of newbie optimizers have to confront by arming yourself with some basic information and making sure to put in a lot of effort.
If you take the time to truly look into the material provided here, you will be one step ahead of 90 percent of the people who perform tests. This is an excellent position to be in if you have faith in the potential of A/B testing to drive a sustained increase in sales.
Knowledge is a limiting element that can only be overcome via experience and learning through repeated practice.
So get testing!