Tag Archives: a/b testing

A/B Testing: Why do different sample size calculators and testing platforms produce different estimates of statistical significance?

A/B testing is a powerful way to increase conversion (e.g., 638% more leads, 78% more conversion on a product page, etc.).

Its strength lies in its predictive ability. When you implement the alternate version suggested by the test, your conversion funnel actually performs the way the test indicated that it would.

To help determine that, you want to ensure you’re running valid tests. And before you decide to implement related changes, you want to ensure your test is conclusive and not just a result of random chance. One important element of a conclusive test is that the results show a statistically significant difference between the control and the treatment.

Many platforms will include something like a “statistical significance status” with your results to help you determine this. There are also several sample size calculators available online, and different calculators may suggest you need different sample sizes for your test.

But what do those numbers really mean? We’ll explore that topic in this MarketingExperiments article.

A word of caution for marketing and advertising creatives: This article includes several paragraphs that talk about statistics in a mathy way — and even contains a mathematical equation (in case these may pose a trigger risk for you). Even so, we’ve done our best to use them only where they serve to clarify rather than complicate.

Why does statistical significance matter?

To set the stage for talking about sample size and statistical significance, it’s worth mentioning a few words about the nature and purpose of testing (aka inferential experimentation) and the nomenclature we’ll use.

We test in order to infer some important characteristics about a whole population by observing a small subset of members from the population called a “Sample.”

MECLABS metatheory dubs a test that successfully accomplishes this purpose a “Useful” test.

The Usefulness (predictiveness) of a test is affected by two key features: “Validity” and “Conclusiveness.”

Statistical significance is one factor that helps to determine if a test is useful. A useful test is one that can be trusted to accurately reflect how the “system” will perform under real-world conditions.

Having an insufficient sample size presents a validity threat known as Sample Distortion Effect. This is a danger because if you don’t get a large enough sample size, any apparent performance differences may have been due to random variation and not true insights into your customers’ behavior. This could give you false confidence that a landing page change that you tested will improve your results if you implement it, when it actually won’t.

“Seemingly unlikely things DO sometimes happen, purely ‘by coincidence’ (aka due to random variation). Statistical methods help us to distinguish between valuable insights and worthless superstitions,” said Bob Kemper, Executive Director, Infrastructure Support Services at MECLABS Institute.

“By our very nature, humans are instinctively programmed to seek out and recognize patterns: think ‘Hmm, did you notice that the last five people who ate those purplish berries down by the river died the next day?’” he said.

A conclusive test is a valid test (There are other validity threats in addition to sample distortion effect.) that has reached a desired Level of Confidence, or LoC (95% is the most commonly used standard).

In practice, at 95% LoC, the 95% confidence interval for the difference between control and treatment rates of the key performance indicator (KPI) does not include zero.

A simple way to think of this is that a conclusive test means you are 95% confident the treatment will perform at least as well as the control on the primary KPI.  So the performance you’ll actually get, once it’s in production for all traffic, will be somewhere inside the Confidence Interval (shown in yellow above).  Determining level of confidence requires some math.

Why do different testing platforms and related tools offer such disparate estimates of required sample size? 

One of MECLABS Institute’s Research Partners who is president of an internet company recently asked our analysts about this topic. His team found a sample size calculator tool online from a reputable company and noticed how different its estimate of minimum sample size was compared to the internal tool MECLABS analysts use when working with Research Partners (MECLABS is the parent research organization of MarketingExperiments).

The simple answer is that the two tools approach the estimation problem using different assumptions and statistical models, much the way there are several competing models for predicting the path of hurricanes and tropical storms.

Living in Jacksonville, Florida, an area that is often under hurricane threats, I can tell you there’s been much debate over which among the several competing models is most accurate (and now there’s even a newer, Next Gen model). Similarly, there is debate in the optimization testing world about which statistical models are best.

The goal of this article isn’t to take sides, just to give you a closer look at why different tools produce different estimates. Not because the math is “wrong” in any of them, they simply employ different approaches.

“While the underlying philosophies supporting each differ, and they approach empirical inference in subtly different ways, both can be used profitably in marketing experimentation,” said Danitza Dragovic, Digital Optimization Specialist at MECLABS Institute.

In this case, in seeking to understand the business implications of test duration and confidence in results, it was understandably confusing for our Research Partner to see different sample size calculations based upon the tool used. It wasn’t clear that a pre-determined sample size is fundamental to testing in some calculations, while other platforms ultimately determine test results irrespective of pre-determined sample sizes, using prior probabilities assigned by the platform, and provide sample size calculators simply as a planning tool.

Let’s take a closer look at each …

Classical statistics 

The MECLABS Test Protocol employs a group of statistical methods based on the “Z-test,” arising from “classical statistics” principles that adopt a Frequentist approach, which makes predictions using only data from the current experiment.

With this method, recent traffic and performance levels are used to compute a single fixed minimum sample size before launching the test.  Status checks are made to detect any potential test setup or instrumentation problems, but LoC (level of confidence) is not computed until the test has reached the pre-established minimum sample size.

While historically the most commonly used for scientific and academic experimental research for the last century, this classical approach is now being met by theoretical and practical competition from tools that use (or incorporate) a different statistical school of thought based upon the principles of Bayesian probability theory. Though Bayesian theory is far from new (Thomas Bayes proposed its foundations more than 250 years ago), its practical application for real-time optimization research required computational speed and capacity only recently available.

Breaking Tradition: Toward optimization breakthroughs

“Among the criticisms of the traditional frequentist approach has been its counterintuitive ‘negative inference’ approach and thought process, accompanied by a correspondingly ‘backwards’ nomenclature. For instance, you don’t ‘prove your hypothesis’ (like normal people), but instead you ‘fail to reject your Null hypothesis’ — I mean, who talks (or thinks) like that?” Kemper said.

He continued, “While Bayesian probability is not without its own weird lexical contrivances (Can you say ‘posterior predictive’?), its inferential frame of reference is more consistent with the way most people naturally think, like assigning the ’probability of a hypothesis being True’ based on your past experience with such things. For a purist Frequentist, it’s impolite (indeed sacrilegious) to go into a test with a preconceived ‘favorite’ or ‘preferred answer.’ One must simply objectively conduct the test and ‘see what the data says.’ As a consequence, the statement of the findings from a typical Bayesian test — i.e., a Bayesian inference — is much more satisfying to a non-specialist in science or statistics than is an equivalent traditional/frequentist one.”

Hybrid approaches

Some platforms use a sequential likelihood ratio test that combines a Frequentist approach with a Bayesian approach. The adjective “sequential” refers to the approach’s continual recalculation of the minimum sample size for sufficiency as new data arrives, with the goal of minimizing the likelihood of a false positive arising from stopping data collection too soon.

Although an online test estimator using this method may give a rough sample size, this method was specifically designed to avoid having to rely on a predetermined sample size, or predetermined minimum effect size. Instead, the test is monitored, and the tool indicates at what point you can be confident in the results.

In many cases, this approach may result in shorter tests due to unexpectedly high effect sizes. But when tools employ proprietary methodologies, the way that minimum sample size is ultimately determined may be opaque to the marketer.

CONSIDERATIONS FOR EACH OF THESE APPROACHES

Classical “static” approaches

Classical statistical tests, such as Z-tests, are the de facto standard across a broad spectrum of industries and disciplines, including academia. They arise from the concepts of normal distribution (think bell curve) and probability theory described by mathematicians Abraham de Moivre and Carl Friedrich Gauss in the 17th to 19th centuries. (Normal distribution is also known as Gaussian distribution.)  Z-tests are commonly used in medical and social science research.

They require you to estimate the minimum detectable effect-size before launching the test and then refrain from “peeking at” Level of Confidence until the corresponding minimum sample size is reached.  For example, the MECLABS Sample Size Estimation Tool used with Research Partners requires that our analysts make pre-test estimates of:

  • The projected success rate — for example, conversion rate, clickthrough rate (CTR), etc.
  • The minimum relative difference you wish to detect — how big a difference is needed to make the test worth conducting? The greater this “effect size,” the fewer samples are needed to confidently assert that there is, in fact, an actual difference between the treatments. Of course, the smaller the design’s “minimum detectable difference,” the harder it is to achieve that threshold.
  • The statistical significance level — this is the probability of accidentally concluding there is a difference due to sampling error when really there is no difference (aka Type-I error). MECLABS recommends a five percent statistical significance which equates to a 95% desired Level of Confidence (LoC).
  • The arrival rate in terms of total arrivals per day — this would be your total estimated traffic level if you’re testing landing pages. “For example, if the element being tested is a page in your ecommerce lower funnel (shopping cart), then the ‘arrival rate’ would be the total number of visitors who click the ‘My Cart’ or ‘Buy Now’ button, entering the shopping cart section of the sales funnel and who will experience either the control or an experimental treatment of your test,” Kemper said.
  • The number of primary treatments — for example, this would be two if you’re running an A/B test with a control and one experimental treatment.

Typically, analysts draw upon a forensic data analysis conducted at the outset combined with test results measured throughout the Research Partnership to arrive at these inputs.

“Dynamic” approaches 

Dynamic, or “adaptive” sampling approaches, such as the sequential likelihood ratio test, are a more recent development and tend to incorporate methods beyond those recognized by classical statistics.

In part, these methods weren’t introduced sooner due to technical limitations. Because adaptive sampling employs frequent computational reassessment of sample size sufficiency and may even be adjusting the balance of incoming traffic among treatments, they were impractical until they could be hosted on machines with the computing capacity to keep up.

One potential benefit can be the test duration. “Under certain circumstances (for example, when actual treatment performance is very different from test-design assumptions), tests may be able to be significantly foreshortened, especially when actual treatment effects are very large,” Kemper said.

This is where prior data is so important to this approach. The model can shorten test duration specifically because it takes prior data into account. An attendant limitation is that it can be difficult to identify what prior data is used and exactly how statistical significance is calculated. This doesn’t necessarily make the math any less sound or valid, it just makes it somewhat less transparent. And the quality/applicability of the priors can be critical to the accuracy of the outcome.

As Georgi Z. Georgiev explains in Issues with Current Bayesian Approaches to A/B Testing in Conversion Rate Optimization, “An end user would be left to wonder: what prior exactly is used in the calculations? Does it concentrate probability mass around a certain point? How informative exactly is it and what weight does it have over the observed data from a particular test? How robust with regards to the data and the resulting posterior is it? Without answers to these and other questions an end user might have a hard time interpreting results.”

As with other things unique to a specific platform, it also impinges on the portability of the data, as Georgiev explains:

A practitioner who wants to do that [compare results of different tests run on different platforms] will find himself in a situation where it cannot really be done, since a test ran on one platform and ended with a given value of a statistic of interest cannot be compared to another test with the same value of a statistic of interest ran on another platform, due to the different priors involved. This makes sharing of knowledge between practitioners of such platforms significantly more difficult, if not impossible since the priors might not be known to the user.

Interpreting MECLABS (classical approach) test duration estimates 

At MECLABS, the estimated minimum required sample size for most experiments conducted with Research Partners is calculated using classical statistics. For example, the formula for computing the number of samples needed for two proportions that are evenly split (uneven splits use a different and slightly more complicated formula) is provided by:

Solving for n yields:

Variables:

  • n: the minimum number of samples required per treatment
  • z: the Z statistic value corresponding with the desired Level of Confidence
  • p: the pooled success proportion — a value between 0 – 1 — (i.e., of clicks, conversions, etc.)
  • δ: the difference of success proportions among the treatments

This formula is used for tests that have an even split among treatments.

Once “samples per treatment” (n) has been calculated, it is multiplied by the number of primary treatments being tested to estimate the minimum number of total samples required to detect the specified amount of “treatment effect” (performance lift) with at least the specified Level of Confidence, presuming the selection of test subjects is random.

The estimated test duration, typically expressed in days, is then calculated by dividing the required total sample size by the expected average traffic level, expressed as visitors per day arriving at the test.

Finding your way 

“As a marketer using experimentation to optimize your organization’s sales performance, you will find your own style and your own way to your destination,” Kemper said.

“Like travel, the path you choose depends on a variety of factors, including your skills, your priorities and your budget. Getting over the mountains, you might choose to climb, bike, drive or fly; and there are products and service providers who can assist you with each,” he advised.

Understanding sampling method and minimum required sample size will help you to choose the best path for your organization. This article is intended to provide a starting point. Take a look at the links to related articles below for further research on sample sizes in particular and testing in general.

Related Resources

17 charts and tools have helped capture more than $500 million in (carefully measured) test wins

MECLABS Institute Online Testing on-demand certification course

Marketing Optimization: How To Determine The Proper Sample Size

A/B Testing: Working With A Very Small Sample Size Is Difficult, But Not Impossible

A/B Testing: Split Tests Are Meaningless Without The Proper Sample Size

Two Factors that Affect the Validity of Your Test Estimation

Frequentist A/B test (good basic overview by Ethen Liu)

Bayesian vs Frequentist A/B Testing – What’s the Difference? (by Alex Birkett on ConversionXL)

Thinking about A/B Testing for Your Client? Read This First. (by Emīls Vēveris on Shopify)

On the scalability of statistical procedures: why the p-value bashers just don’t get it. (by Jeff Leek on SimplyStats)

Bayesian vs Frequentist Statistics (by Leonid Pekelis on Optimizely Blog)

Statistics for the Internet Age: The Story Behind Optimizely’s New Stats Engine (by Leonid Pekelis on Optimizely Blog)

Issues with Current Bayesian Approaches to A/B Testing in Conversion Rate Optimization (by Georgi Z. Georgiev on Analytics-Toolkit.com)

 

The post A/B Testing: Why do different sample size calculators and testing platforms produce different estimates of statistical significance? appeared first on MarketingExperiments.

A/B Testing Prioritization: The surprising ROI impact of test order

I want everything. And I want it now.

I’m sure you do, too.

But let me tell you about my marketing department. Resources aren’t infinite. I can’t do everything right away. I need to focus myself and my team on the right things.

Unless you found a genie in a bottle and wished for an infinite marketing budget (right after you wished for unlimited wishes, natch), I’m guessing you’re in the same boat.

When it comes to your conversion rate optimization program, it means running the most impactful tests. As Stephen Walsh said when he wrote about 19 possible A/B tests for your website on Neil Patel’s blog, “testing every random aspect of your website can often be counter-productive.”

Of course, you probably already know that. What may surprise you is this …

It’s not enough to run the right tests, you will get a higher ROI if you run them in the right order

To help you discover the optimal testing sequence for your marketing department, we’ve created the free MECLABS Institute Test Planning Scenario Tool (MECLABS is the parent research organization of MarketingExperiments).

Let’s look at a few example scenarios.

Scenario #1: Level of effort and level of impact

Tests will have different levels of effort to run. For example, it’s easier to make a simple copy change to a headline than to change a shopping cart.

This level of effort (LOE) sometimes correlates to the level of impact the test will have to your bottom line. For example, a radical redesign might be a higher LOE to launch, but it will also likely produce a higher lift than a simple, small change.

So how does the order you run a high effort, high return, and low effort, low return test sequence affect results? Again, we’re not saying choose one test over another. We’re simply talking about timing. To the test planning scenario tool …

Test 1 (Low LOE, low level of impact)

  • Business impact — 15% more revenue than the control
  • Build Time — 2 weeks

Test 2 (High LOE, high level of impact)

  • Business impact — 47% more revenue than the control
  • Build Time — 6 weeks

Let’s look at the revenue impact over a six-month period. According to the test planning tool, if the control is generating $30,000 in revenue per month, running a test where the treatment has a low LOE and a low level of impact (Test 1) first will generate $22,800 more revenue than running a test where the treatment has a high LOE and a high level of impact (Test 2) first.

Scenario #2: An even larger discrepancy in the level of impact

It can be hard to predict the exact level of business impact. So what if the business impact differential between the higher LOE test is even greater than in Scenario #1, and both treatments perform even better than they did in Scenario #1? How would test sequence affect results in that case?

Let’s run the numbers in the Test Planning Scenario Tool.

Test 1 (Low LOE, low level of impact)

  • Business impact — 25% more revenue than the control
  • Build Time — 2 weeks

Test 2 (High LOE, high level of impact)

  • Business impact — 125% more revenue than the control
  • Build Time — 6 weeks

According to the test planning tool, if the control is generating $30,000 in revenue per month, running Test 1 (low LOE, low level of impact) first will generate $45,000 more revenue than running Test 2 (high LOE, high level of impact) first.

Again, same tests (over a six-month period) just a different order. And you gain $45,000 more in revenue.

“It is particularly interesting to see the benefits of running the lower LOE and lower impact test first so that its benefits could be reaped throughout the duration of the longer development schedule on the higher LOE test. The financial impact difference — landing in the tens of thousands of dollars — may be particularly shocking to some readers,” said Rebecca Strally, Director, Optimization and Design, MECLABS Institute.

Scenario #3: Fewer development resources

In the above two examples, the tests were able to be developed simultaneously. What if the test cannot be developed simultaneously (must be developed sequentially) and can’t be developed until the previous test has been implemented? Perhaps this is because of your organization’s development methodology (Agile vs. Waterfall, etc.), or there is just simply a limit on your development resources. (They likely have many other projects than just developing your tests.)

Let’s look at that scenario, this time with three treatments.

Test 1 (Low LOE, low level of impact)

  • Business impact — 10% more revenue than the control
  • Build Time — 2 weeks

Test 2 (High LOE, high level of impact)

  • Business impact — 360% more revenue than the control
  • Build Time — 6 weeks

Test 3 (Medium LOE, medium level of impact)

  • Business impact — 70% more revenue than the control
  • Build Time — 3 weeks

In this scenario, Test 2 first, then Test 1 and finally Test 3, along with Test 2, then Test 3, then Test 1 were the highest-performing scenarios. The lowest-performing scenario was Test 3, Test 1, Test 2. The difference was $894,000 more revenue from using one of the highest-performing test sequences versus the lowest-performing test sequence.

“If development for tests could not take place simultaneously, there would be a bigger discrepancy in overall revenue from different test sequences,” Strally said.

“Running a higher LOE test first suddenly has a much larger financial payoff. This is notable because once the largest impact has been achieved, it doesn’t matter in what order the smaller LOE and impact tests are run, the final dollar amounts are the same. Development limitations (although I’ve rarely seen them this extreme in the real world) created a situation where whichever test went first had a much longer opportunity to impact the final financial numbers. The added front time certainly helped to push running the highest LOE and impact test first to the front of the financial pack,” she added.

The Next Scenario Is Up To You: Now forecast your own most profitable test sequences

You likely don’t have the exact perfect information we provided in the scenarios. We’ve provided model scenarios above, but the real world can be trickier. After all, as Nobel Prize-winning physicist Niels Bohr said, “Prediction is very difficult, especially if it’s about the future.”

“We rarely have this level of information about the possible financial impact of a test prior to development and launch when working to optimize conversion for MECLABS Research Partners. At best, the team often only has a general guess as to the level of impact expected, and it’s rarely translated into a dollar amount,” Strally said.

That’s why we’re providing the Test Planning Scenario Tool as a free, instant download. It’s easy to run a few different scenarios in the tool based on different levels of projected results and see how the test order can affect overall revenue. You can then use the visual charts and numbers created by the tool to make the case to your team, clients and business leaders about what order you should run your company’s tests.

Don’t put your tests on autopilot

Of course, things don’t always go according to plan. This tool is just a start. To have a successful conversion optimization practice, you have to actively monitor your tests and advocate for the results because there are a number of additional items that could impact an optimal testing sequence.

“There’s also the reality of testing which is not represented in these very clean charts. For example, things like validity threats popping up midtest and causing a longer run time, treatments not being possible to implement, and Research Partners requesting changes to winning treatments after the results are in, all take place regularly and would greatly shift the timing and financial implications of any testing sequence,” Strally said.

“In reality though, the number one risk to a preplanned DOE (design of experiments) in my experience is an unplanned result. I don’t mean the control winning when we thought the treatment would outperform. I mean a test coming back a winner in the main KPI (key performance indicator) with an unexpected customer insight result, or an insignificant result coming back with odd customer behavior data. This type of result often creates a longer analysis period and the need to go back to the drawing board to develop a test that will answer a question we didn’t even know we needed to ask. We are often highly invested in getting these answers because of their long-term positive impact potential and will pause all other work — lowering financial impact — to get these questions answered to our satisfaction,” she said.

Related Resources

MECLABS Institute Online Testing on-demand certification course

Offline and Online Optimization: Cabela’s shares tactics from 51 years of offline testing, 7 years of digital testing

Landing Page Testing: Designing And Prioritizing Experiments

Email Optimization: How To Prioritize Your A/B Testing

The post A/B Testing Prioritization: The surprising ROI impact of test order appeared first on MarketingExperiments.

In Conversion Optimization, The Loser Takes It All

Most of us at some point in our lives have experienced that creeping, irrational fear of failure, of being an imposter in our chosen profession or deemed “a Loser” for not getting something right the first time. marketers who work in A/B testing and conversion optimization.

We are constantly tasked with creating new, better experiences for our company or client and in turn the customers they serve. Yet unlike many business ventures or fire-and-forget ad agency work, we then willingly set out to definitively prove that our new version is better than the old, thus throwing ourselves upon the dual fates of customer decision making and statistical significance.

And that’s when the sense of failure begins to creep in, when you have to present a losing test to well-meaning clients or peers who were so convinced that this was a winner, a surefire hit. The initial illusion they had — that you knew all the right answers — so clinically shattered by that negative percentage sign in front of your results.

Yet of course herein lays the mistake of both the client and peer who understandably need quick, short-term results or the bravado of the marketer who thinks they can always get it right the first time.

A/B testing and conversion optimization, like the scientific method these disciplines apply to marketing, is merely a process to get you to the right answer, and to view it as the answer itself is to mistake the map for the territory.

I was reminded of this the other day when listening to one of my favorite science podcasts, “The Skeptics Guide to the Universe,” hosted by Dr. Steven Novella, which ends each week with a relevant quote. That week they quoted Brazilian-born, English, Nobel Prize-winning zoologist Sir Peter B. Medawar (1915 -1987) from his 1979 book “Advice to a Young Scientist.” In it he stated, “All experimentation is criticism. If an experiment does not hold out the possibility of causing one to revise one’s views, it is hard to see why it should be done at all.”This quote for me captures a lot of the truisms I’ve learnt in my experience as a conversion optimization marketer, as well as addresses a lot of the confusion in many MECLABS Institute Research Partners and colleagues who are less familiar with the nature and process of conversion optimization.

Here are four points to keep in mind if you choose to take a scientific approach to your marketing:

1. If you truly knew what the best customer experience was, then you wouldn’t test

I have previously been asked after presenting a thoroughly researched outline of planned testing, that although the methodic process to learning we had just outlined was greatly appreciated, did we not know a shortcut we could take to get to a big success.

Now, this is a fully understandable sentiment, especially in the business world where time is money and everyone needs to meet their targets yesterday. That said, the question does fundamentally miss the value of conversion optimizing testing, if not the value of the scientific method itself. Remember, this method of inquiry has allowed us — through experimentation and the repeated failure of educated, but ultimately false hypotheses — to finally develop the correct hypothesis and understanding of the available facts. As a result, we are able to cure disease, put humans on the moon and develop better-converting landing pages.

In the same vein, as marketers we can do in-depth data and customer research to get us closer to identifying the correct conversion problems in a marketing funnel and to work out strong hypotheses about what the best solutions are, but ultimately we can’t know the true answer until we test it.

A genuine scientific experiment should be trying to prove itself wrong as much as it is proving itself right. It is only through testing out our false hypothesis that we as marketers can confirm the true hypothesis that represents the correct interpretation of the available data and understanding of our customers that will allow us to get the big success we seek for our clients and customers.

2. If you know the answer, just implement it

This particularly applies to broken elements in your marketing or conversion funnel.

An example of this from my own recent experience with a client was when we noticed in our initial forensic conversion analysis of their site that the design of their cart made it almost impossible to convert on a small mobile or desktop screen if you had more than two products in your cart.

Looking at the data and the results from our own user testing, we could see that this was clearly broken and not just an underperformance. So we just recommended that they fix it, which they did.

We were then able to move on and optimize the now-functioning cart and lower funnel through testing, rather than wasting everyone’s time with a test that was a foregone conclusion.

3. If you see no compelling reason why a potential test would change customer behavior, then don’t do it

When creating the hypothesis (the supposition that can be supported or refuted via the outcome of your test), make sure it is a hypothesis based upon an interpretation of available evidence and a theory about your customer.

Running the test should teach you something about both your interpretation of the data and the empathetic understanding you think you have of your customer.

If running the test will do neither, then it is unlikely to be impactful and probably not worth running.

4. Make sure that the changes you make are big enough and loud enough to impact customer behavior

You might have data to support the changes in your treatment and a well-thought-out customer theory, but if the changes you make are implemented in a way that customers won’t notice them, then you are unlikely to elicit the change you expect to see and have no possibility of learning something.

Failure is a feature, not a bug

So next time you are feeling like a loser, when you are trying to explain why your conversion optimization test lost:

  • Remind your audience that educated failure is an intentional part of the process:
  • Focus on what you learnt about your customer and how you have improved upon your initial understanding of the data.
  • Explain how you helped the client avoid implementing the initial “winning idea” that, it turns out, wasn’t such a winner — and all the money this saved them.

Remember, like all scientific testing, conversion optimization might be slow, methodical and paved with losing tests, but it is ultimately the only guaranteed way to build repeatable, iterative, transferable success across a business.

Related Resources:

Optimizing Headlines & Subject Lines

Consumer Reports Value Proposition Test: What You Can Learn From A 29% Drop In Clickthrough

MarketingExperiments Research Journal (Q1 2011) — See “Landing Page Optimization: Identifying friction to increase conversion and win a Nobel Prize” starting on page 106

The post In Conversion Optimization, The Loser Takes It All appeared first on MarketingExperiments.

Conversion Optimization Testing: Validity threats from running multiple tests at the same time

A/B testing is popular among marketers and businesses because it gives you a way to determine what really works between two (or more) options.

However, to truly extract value from your testing program, it requires more than simply throwing some headlines or images into a website testing tool. There are ways you can undermine your testing tool that the tool itself can’t prevent.

It will still spit out results for you. And you’ll think they’re accurate.

These are called validity threats. In other words, they threaten the ability of your test to give you information that accurately reflects what is really happening with your customer. Instead, you’re seeing skewed data from not running the test in a scientifically sound manner.

In the MECLABS Institute Online Testing certification course, we cover validity threats like history effect, selection effect, instrumentation effect and sampling distortion effect. In this article, we’ll zoom in on one example of a selection effect that might cause a validity threat and thus misinterpretation of results — running multiple tests at the same time — which increases the likelihood of a false positive.

Interaction Effect — different variations in the tests can influence each other and thus skew the data

The goal of an experiment is to isolate a scenario that accurately reflects how the customer experiences your sales and marketing path. If you’re running two tests at the same time, the first test could influence how they experience the second test and therefore their likelihood to convert.

This is a psychological phenomenon known as priming. If we talk about the color yellow and then I ask you to mention a fruit, you’re more likely to answer banana. But if we talk about red and I ask you to mention a fruit, you’re more likely to answer apple. 

Another way interaction effect can threaten the validity is with a selection effect. In other words, the way you advertise near the beginning of the funnel impacts the type of customer and the motivations of the customer you’re bringing through your funnel.

Taylor Bartlinski, Senior Manager, Data Analytics, MECLABS Institute, provides this example:

“We run an SEO test where a treatment that uses the word ‘cheap’ has a higher clickthrough rate than the control, which uses the word ‘trustworthy.’ At the same time, we run a landing page test where the treatment also uses the word ‘cheap’ and the control uses ‘trustworthy.’  The treatments in both tests with the ‘cheap’ language work very well together to create a higher conversion rate, and the controls in each test using the ‘trustworthy’ language work together just as well.  Because of this, the landing page test is inconclusive, so we keep the control. Thus, the SEO ad with ‘cheap’ language is implemented and the landing page with ‘trustworthy’ language is kept, resulting in a lower conversion rate due to the lack of continuity in the messaging.”

Running multiple tests and hoping for little to no validity threat

The level of risk depends on the size of the change and the amount of interaction. However, that can be difficult to gauge before, and even after, the tests are run.

“Some people believe (that) unless you suspect extreme interactions and huge overlap between tests, this is going to be OK. But it is difficult to know to what degree you can suspect extreme interactions. We have seen very small changes have very big impacts on sites,” Bartlinski says.

Another example Bartlinski provides is where there this is little interaction between tests. For example, testing PPC landing pages that do not interact with organic landing pages that are part of another test — or testing separate things in mobile and desktop at the same time. “This lowers the risk, but there still may be overlap. It’s still an issue if a percentage gets into both tests; not ideal if we want to isolate findings and be fully confident in customer learnings,” Bartlinski said.

How to overcome the interaction effect when testing at the speed of business

In a perfect scientific experiment, multiple tests would not be run simultaneously. However, science often has the luxury of moving at the speed of academia. In addition, many scientific experiments are seeking to discover knowledge that can have life or death implications.

If you’re reading this article, you likely don’t have the luxury of taking as much time with your tests. You need results — and quick. You also are dealing with business risk, and not the high stakes of, for example, human life or death.

There is a way to run simultaneous tests while limiting validity threats — running multiple tests on (or leading to) the same webpage but splitting traffic so people do not see different variations at the same time.

“Running mutually exclusive tests will eliminate the above validity threats and will allow us to accurately determine which variations truly work best together,” Bartlinski said.

There is a downside though. It will slow down testing since an adequate sample size is needed for each test. If you don’t have a lot of traffic, it may end up taking the same amount of time as running tests one after another.

What’s the big idea?

Another important factor to consider is that the results from grouping the tests should lead to a new understanding of the customer — or what’s the point of running the test?

Bartlinski explains, “Grouping tests makes sense if tests measure the same goal (e.g., reservations), they’re in the same flow (e.g., same page/funnel), and you plan to run them for the same duration.”

The messaging should be parallel as well so you get a lesson. Pointing a treatment ad that focuses on cost to a treatment landing page that focuses on luxury, and then a treatment ad that focuses on luxury pointing to an ad that focuses on cost will not teach you much about your customer’s motivations.

If you’re running multiple tests on different parts of the funnel and aligning them, you should think of each flow as a test of a certain assumption about the customer as part of your overall hypothesis.

It is similar to a radical redesign. Much like testing multiple steps of the funnel can cause an interaction effect, testing multiple elements on a single landing page or in a single email can cause an attribution issue. Which change caused the result we see?

Bartlinski provides this example, “On the same landing page, we run a test where both the call-to-action (CTA) and the headline have been changed in the treatment. The treatment wins, but is it because of the CTA change or the headline? It is possible that the increase comes exclusively from the headline, while the new CTA is actually harming the clickthrough rate. If we tested the headline in isolation, we would be able to determine whether the combination of the new headline and old CTA actually has the best clickthrough, and we are potentially missing out on an even bigger increase.”

While running single-factorial A/B tests is the best way to isolate variables and determine with certainty which change caused a result, if you’re testing at the speed of business you don’t have that luxury. You need results and you need them now!

However, if you align several changes in a single treatment around a common theme that represents something you’re trying to learn about the customer (aka radical redesign), you can get a lift while still attaining a customer discovery. And then, in follow-up single-factorial A/B tests, narrow down which variables had the biggest impact on the customer.

Another cause of attribution effect is running multiple tests on different parts of a landing page because you assume they don’t interact. Perhaps, you run a test on two different ways to display locations on a map in the upper left corner of the page. Then a few days later, while that test is still running, you launch a second test on the same page but in the lower right corner on how star ratings are displayed in the results.

You could assume these two changes won’t have an effect on each other. However, the variables haven’t been isolated from the tests, and they might influence each other. Again, small changes can have big effects. The speed of your testing might necessitate testing like this; just know the risk involved in terms of skewed results.

To avoid that risk, you could run multivariate tests or mutually exclusive tests which would essentially match each combination of multiple variables together into a separate treatment. Again, the “cost” would be that it would take longer for the test to reach a statistically significant sample size since the traffic is split among more treatments.

Test strategically

The big takeaway here is — you can’t simply trust a split testing tool to give you accurate results. And it’s not necessarily the tool’s fault. It’s yours. The tool can’t possibly know ways you are threatening the validity of your results outside that individual split test.

If you take a hypothesis-driven approach to your testing, you can test fast AND smart, getting a result that accurately reflects the real-world situation while discovering more about your customer.

You might also like:

Online Testing certification course — Learn a proven methodology for executing effective and valid experiments

Optimization Testing Tested: Validity threats beyond sample size

Validity Threats: 3 tips for online testing during a promotion (if you can’t avoid it)

B2B Email Testing: Validity threats cause Ferguson to miss out on lift from Black Friday test

Validity Threats: How we could have missed a 31% increase in conversions

The post Conversion Optimization Testing: Validity threats from running multiple tests at the same time appeared first on MarketingExperiments.

Call Center Optimization: How a nonprofit increased donation rate 29% with call center testing

If you’ve read MarketingExperiments for any length of time, you know that most of our marketing experiments occur online because we view the web as a living laboratory.

However, if your goal is to learn more about your customers so you can practice customer-first marketing and improve business results, don’t overlook other areas of customer experimentation as well.

To wit, this article is about a MECLABS Institute Research Partner who engaged in call center testing.

Overall Research Partnership Objective

Since the Research Partner was a nonprofit, the objective of the overall partnership focused on donations. Specifically, to increase the total amount of donations (number and size) given by both current and prospective members.

While MECLABS engaged with the nonprofit in digital experimentation as well (for example, on the donation form), the telephone was a key channel for this nonprofit to garner donations.

Call Script Test: Initial Analysis

After analyzing the nonprofit’s calls scripts, the MECLABS research analysts identified several opportunities for optimization. For the first test, they focused on the call script’s failure to establish rapport with the caller and only mentioning the possibility of donating $20 per month, mentally creating a ceiling for the donation amount.

Based on that analysis, the team formulated a test. The team wanted to see if they could increase overall conversion rate by establishing rapport early in the call. The previous script jumped in with the assumption of a donation before connecting with the caller.

Control Versus Treatment

In digital A/B testing, traffic is split between a control and treatment. For example, 50% of traffic to a landing page is randomly selected to go to the control. And the other 50% is randomly selected to go to the treatment that includes the optimized element or elements: optimized headline, design, etc. Marketers then compare performance to see if the tested variable (e.g., the headline) had an impact on performance.

In this case, the Research Partner had two call centers. To run this test, we provided optimized call scripts to one call center and left the other call center as the control.

We made three key changes in the treatment with the following goals in mind:

  • Establish greater rapport at the beginning of the call: The control goes right into asking for a donation – “How may I assist you in giving today?” However, the treatment asked for the caller’s name and expressed gratitude for the previous giving.
  • Leverage choice framing by recommending $20/month, $40/month, or more: The control only mentioned the $20/month option. The addition of options allows potential donors to make a choice and not have only one option thrust upon them.
  • Include an additional one-time cause-related donation for both monthly givers and other appropriate calls: The control did not ask for a one-time additional donation. The ongoing donation supported the nonprofit’s overall mission; however, the one-time donation provided another opportunity for donors to give by tying specifically into a real-time pressing matter that the nonprofit’s leaders were focused on. If they declined to give more per month for financial reasons, they were not asked about the one-time donation.

To calibrate the treatment before the experimentation began, a MECLABS researcher flew to the call center site to train the callers and pretest the treatment script.

While the overall hypothesis stayed the same, after four hours of pretesting, the callers reconvened to make minor tweaks to the wording based on this pretest. It was important to preserve key components of the hypothesis; however, the callers could make small tweaks to preserve their own language.

The treatment was used on a large enough sample size — in this case, 19,655 calls — to detect a statistically valid difference between the control and the treatment.

Results

The treatment script increased the donation rate from 14.32% to 18.47% at a 99% Level of Confidence for a 29% relative increase in the donation rate.

Customer Insights

The benefits of experimentation go beyond the incremental increase in revenue from this specific test. By running the experiment in a rigorously scientific fashion — accounting for validity threats and formulating a hypothesis — marketers can build a robust customer theory that helps them create more effective customer-first marketing.

In this case, the “customers” were donors. After analyzing the data in this experiment, the team discovered three customer insights:

  • Building rapport on the front end of the script generated a greater openness with donors and made them more likely to consider donating.
  • Asking for a one-time additional donation was aligned with the degree of motivation for many of the callers. The script realized a 90% increase in one-time gifts.
  • Discovering there was an overlooked customer motivation — to make one-time donations, not only ongoing donations sought by the organization. Part of the reason may be due to the fact that the ideal donors were in an older demographic, which made it difficult for them to commit at a long-term macro level and much easier to commit at a one-time micro level. (Also, it gave the nonprofit an opportunity to tap into not only the overall motivation of contributing to the organization’s mission but contributing to a specific timely issue as well.)

The experimentation allowed the calling team to look at their role in a new way. Many had been handling these donors’ calls for several years, even decades, and there was an initial resistance to the script. But once they saw the results, they were more eager to do future testing.

Can You Improve Call Center Performance?

Any call center script is merely a series of assumptions. Whether your organization is nonprofit or for-profit, B2C or B2C, you must ask a fundamental question — what assumptions are we making about the person on the other line with our call scripts?

And the next step is — how can we learn more about that person to draft call center scripts with a customer-first marketing approach that will ultimately improve conversion?

You can follow Daniel Burstein, Senior Director, Content & Marketing, MarketingExperiments, on Twitter @DanielBurstein.

You Might Also Like

Lead Nurturing: Why good call scripts are built on storytelling

Online Ads for Inbound Calls: 5 Tactics to get customers to pick up the phone

B2B Lead Generation: 300% ROI from email and teleprospecting combo to house list

Learn more about MECLABS Research Partnerships

The post Call Center Optimization: How a nonprofit increased donation rate 29% with call center testing appeared first on MarketingExperiments.

7 Lessons for Testing with Limited Data and Resources

It feels like there’s never enough time, money or headcount to do marketing testing and optimization “right.” The BIG WIN in our marketing experimentation, whether it is conversions, revenue, leads, etc., never seems to come quick enough. I’ve been there with you. However, after running 33+ tests in 18 months with our team, I can testify there are simple lessons and tools to help you optimize your marketing campaigns.

Here is a list of 7 simple, effective lessons and tools to leverage in your marketing optimization.

1. Determine optimal testing sequence based on projected impact, level of importance and level of effort

Marketers are continually asking, “How do I know what to test first?” I recommend prioritizing tests with higher potential revenue impact first. Remember to factor in the time for key phases of testing (build time, run time, analysis and decision time, implementation time), as well as the organization’s ability to develop and implement tests simultaneously.

Sample Test Sequence Calculation Sheet

2. Save development time and money by creating wireframes and prototypes first

Get organizational buy-in using landing page, email, etc. prototypes before spending time and resources on development. Axure is my personal wireframe tool of choice because of the drag and drop nature of the pre-built widgets, the ability to lay out the page with pixel perfect accuracy, and the flexibility of the prototyping functions.

Wireframe [in Axure RP], then develop your landing pages, saving time and money on revisions

3. Determine minimum sample size, test duration and level of confidence

Before testing, determine your necessary sample size, estimated test duration and number of treatments to achieve test results you can feel confident in. Save yourself the frustration and time of conducting inconclusive tests that do not achieve a desired level of confidence. Here’s a simple, free tool to help you get started.

Free Testing Tool from MECLABS

4. Get qualitative, crowd-sourced feedback before launching

Leverage User Testing to better understand consumer preferences and avoid the marketer’s blind spot.

5. Content test emails for increased clickthrough rates

Increase open rates with subject, from-field, and time/day testing; increase clickthrough rates with content testing. Test up to eight versions of each email for increased performance. I am personally a fan of Mailchimp Pro and have driven a lot of sales through this easy to operate ESP.

6. Results are good, repeatable methods are better

A single test that increases conversion rates, sales and revenue is worth celebrating. However, developing customer insights that allow you to apply what you learned to future marketing campaigns will result in a much larger cumulative impact on the business. What was the impact from the test, AND more importantly, what did we learn?

Summarize results AND customer insights after each test

7. Bring your team with you

Organizational transformation toward a culture of testing and optimization only occurs when others believe in it with you. The most practical education I received to increase my marketing performance was from the world’s first (and only) graduate education program focused specifically on how to help marketers increase conversion. Having three people from our organization in the program changed how we talk about marketing. We moved from making decisions based on our intuition to testing our hypotheses to improve performance. 

Sample lecture from Graduate Certificate Program with Dr. Flint McGlaughlin

The post 7 Lessons for Testing with Limited Data and Resources appeared first on MarketingExperiments.

The Hypothesis and the Modern-Day Marketer

On the surface, the words “hypothesis” and “marketing” seem like they would never be in the same sentence, let alone the same paragraph. Hypotheses are for scientists with fancy lab coats on, right? I totally understand this perspective because, unless A/B Testing CRO (conversion rate optimization) is part of your company’s culture, you may ask yourself, “Where would I ever use a hypothesis in my daily activities?” To this question, I would answer “EVERYWHERE.”

By everywhere, I don’t just mean for your marketing collateral but also for any change within your company. For example, I oversee the operations of our Research Partnerships, and when making a team structure change earlier this year, I created a hypothesis that over the next few months I will prove or disprove. Here’s a modified version of this hypothesis:

IF we centralize partnership launches BY creating a dedicated Research Partner launch team and by building out processes that systematically focus on the most critical business objectives from the start, we WILL increase efficiency and effectiveness (and ultimately successful Partnerships) BECAUSE our Research Partners place high value on efforts that achieve their most critical business objectives first.

Can your offers and messaging be optimized?

The American Marketing Association defines marketing as “the activity, set of institutions, and processes for creating, communicating, delivering, and exchanging offerings that have value for customers, clients, partners, and society at large.”

So, if you are reading this blog right now and your job function reflects the above definition even in the slightest, can you answer one question for me: Do you have 100% confidence that each offer and message you create and deliver to your customers is the best it could possibly be?

If you answered yes, then our country desperately needs you. Please apply here: https://www.usa.gov/government-jobs.

If you answered no, then you are like most of us (and Socrates, by the way, i.e., “I know that I know nothing”) and you should have hypothesis creation operationalized into your process, because why wouldn’t you want to test message A versus message B to a small audience before sending an email out to your entire list?

How to create a marketing hypothesis

So first, let’s discuss what a hypothesis is and how to create one.

While there are several useful definitions of the word hypothesis, in Session 05: Developing an Effective Hypothesis, University of Florida’s graduate course MMC 5422, Flint McGlaughlin proposes the following definition as a useful starting point in behavioral testing: “A hypothesis is a supposition or proposed statement made on the basis of limited evidence that can be supported or refuted and is used as a starting point for further investigation.”

Now that we know what we are looking for, we need a tool to help us get there. In Session 04, Crafting a Powerful Research Question of this same course, McGlaughlin reveals that this tool is the MECLABS Discovery Triad, a conceptual thinking tool that leads to the creation of an effective hypothesis — the “h” in the center of the triad represents the hypothesis.

Before creating a hypothesis, the scientists at MECLABS Institute use this Discovery Triad to complete the following steps for all of our Research Partners:

  1. We uncover the business objective (or business question) driving the effort. Typically, we find two patterns regarding the business objective. First, it is broader in scope than a research question that would be suitable for an experiment. Second, this objective takes the form of a question, which typically starts with the interrogative “How.” For example, “How do I get more leads?” “How do I drive more traffic?” or “How do I increase clickthrough rate (CTR)?” My business objective in the examples above was, “How do I create a more valuable research partnership from the perspective of the research partner?”
  1. Now that we are focused on an objective, we ask a series of “What” questions. For example, “What is happening on this webpage?” or “Where are visitors to page A going if they do not make it to page B?” Essentially, we are looking to understand what the data can tell us about the behavior of the prospective customer. This series of “What” questions should encompass both quantitative (e.g., on-page clicks, next page visits, etc.) and qualitative questions. (e.g., What is the customer’s experience on this page?)
  1. We ask a question which starts with the interrogative “Why.” A “Why” question enables us to make a series of educated guesses as to why the customer is behaving in a certain way. For example, “Why are 75% of visitors not clicking the ‘Continue to Checkout’ button?” “Why are 20% of shoppers not adding the blue widget to their cart?” or “Why are only 5% of visitors starting the free trial from this page?” To answer “Why” questions, the research scientists at MECLABS apply the patented Conversion Heuristic to the page:
    1. What can we remove, add or change to minimize perceived cost?
    2. What can we remove, add or change to intensify perceived value?
    3. What can we remove, add or change to leverage motivation?

  1. We ask a second, more refined, “How” question (research question) that identifies the best testing objective. For example, if your business question was, “How do we sell more blue widgets?” and during the “What” stage, you analyzed your entire funnel, discovering that the biggest performance leak is on your blue widget product page, then your Research Question could be something like, “How do we increase CTR from the blue widget product page to the shopping cart checkout page?”

Essentially, a powerful Research Question focuses your broader Business Question around a specific subject of research. After all, there are many ways to sell more blue widgets but only a handful of possible ways to sell more blue widgets from the product page.

The four components of a hypothesis

With a powerful Research Question created, you are now ready to develop a series of hypotheses that will help you discover how to express your offer to achieve more of your desired customer behavior using the MECLABS Four-step Hypothesis Development Framework:

  1. The IF statement. This is your summary description of the “mental lever” your primary change will pull. In fact, the mental lever is usually a cognitive shift you are trying to achieve in the mind of your prospective customers. For example, “IF we emphasize the savings they receive” or “IF we bring clarity to the product page.”
  1. The BY statement. This statement lists the variable(s) or set of variables (variable cluster) you are testing. This statement typically involves the words “add, remove or change.” For example, “BY removing unnecessary calls-to-action (CTAs)” or “BY adding a relevant testimonial and removing the video player.” (Tip: This statement should not contain detailed design requirements. That next level of precision occurs when you develop treatments or articulate your hypothesis.)
  1. The WILL statement. This should be the easiest statement to compose because it is the desired result you hope to achieve from the changes you are proposing. For example, “We WILL increase clickthrough rate,” or “We WILL increase the number of video views.” (Tip: This statement should tightly align with the Test Question.)
  1. The BECAUSE statement. While last in order of appearance, this statement is the most critical as it’s what connects your work deeply into your customer’s being. By that I mean, the metric identified in your WILL statement either increased or decreased because the change you made resonated or did not resonate in the mind of your customer. For example, “BECAUSE prospective customers were still searching for the best deal, and every distraction made them think that there’s a better deal still out there,” or “BECAUSE prospective customers clearly see the savings they receive.” (Tip: Your BECAUSE statement should be centered around a single customer insight that, through testing, adds to your broader customer theory.)

So, if you put all this together, you have:

IF we bring clarity to the product page BY removing unnecessary CTAs, we WILL increase clickthrough rate BECAUSE prospective customers were still searching for the best deal, and every distraction made them think that there’s a better deal still out there.

Just like I used the Discovery Triad and Four-Step Hypothesis Development Frameworks for a team structure change, these processes can be used for any type of collateral you are “creating, communicating, delivering.” Even if you are not A/B testing and just making changes or updates, it’s a valuable exercise to go through to ensure you are not just constantly throwing spaghetti on the wall, but rather, systematically thinking through your changes and how they impact the customer.

You might also like

Online Marketing Tests: How Could You Be So Sure?

The world’s first (and only) graduate education program focused specifically on how to help marketers increase conversion

Optimizing Forms: How To Increase The Perceived Value For Your Customers

Download the Executive Series: The Web as a Living Laboratory

The post The Hypothesis and the Modern-Day Marketer appeared first on MarketingExperiments.

5 a/b tests to try in game apps

We talk a lot about a/b testing here on MarketingExperiments. What we don’t usually talk about is a/b testing for the mobile web…especially testing within mobile apps.

I thought we should change that. As I was scouring the web looking for mobile a/b tests, I found this 2-year old video by Amazon.

Apparently, Amazon Web Services (AWS) at one point had an a/b testing feature that is now closed
.
When they had testing, however, one app developer used it extensively and shared their experiences in a promotional video for the feature on Amazon. The developers were behind the game Air Patriots. Russell Caroll was the Senior Producer for the game and Julio Gorge was the Game Development Engineer. The game is a kind of aerial take on the classic tower defense game genre.

air patriots screenshot

Air Patriots

Now granted, this was a promotional video, but the content still speaks for itself. These guys had (and still have by the looks of it) a fairly successful mobile app and they ran some successful tests. It’s a great starting place for what you can test in your mobile app.

By the way, while Amazon has shut down its a/b testing feature, there are a lot of other tools for testing mobile apps that will accomplish the same thing the developers talk about in the video.

Test #1: What is the impact of ads on customer experience? (1:34)

The first thing the team tested was the impact of ads are on their customers. They wanted to make sure the ads did not harm the customer experience. So they tested a single ad in the main menu near the bottom of the screen.

air patriots menu ad

They found that the ads didn’t affect customer retention. This meant that they could insert ads and generate more revenue without hurting their customers.

Test #2: Will in-game ad placement affect customer retention? (2:56)

In the second test, the team put ads in the game screen.

air patriots in-game ad

In both the first and second tests, the ads had a little “X” that the customers could tap to hypothetically dismiss the ads. When they tapped, a pop up came up that told customers they could eliminate ads with any purchase in the game’s store.

In this test, there was again, no impact on customer retention, but there was a statistically significant increase in revenue.

Test #3: Simple game-circle icon test (4:20)

In this test, the team wanted to know whether an icon to the game-circle (Amazon’s game stats and leaderboards portal) would improve performance.

air patriots game-circle

It’s not clear which icon won, or even why this particular test was useful for the team, but they did get a favorable result, and the lesson they wanted to drive home was that simple changes like icons can make a difference. We’ve, of course found that to be the case in a large number of our tests on MarketingExperiments.

Test #4: Does game difficulty affect revenue? (4:58)

In this 4th test, Caroll made a mistake. He accidentally changed the game difficulty to make it about 10% harder. As a result, every metric that was important to them tanked.

air patriots metrics

The team of course fixed it as fast as possible, but it gave them an idea.

What would happen to revenue if they made the game easier?

So they ran a test that had 5 treatments: The control and then 4 difficulty levels that were easier than the control.

It turned out that the easiest difficulty performed the best. By making it easier, they players playing 20% longer and revenue went up 20%.

Test #5: When is the best time to have push-notifications for re-engagement in inactive players? (7:43)

The team then tested a push-notification that offered inactive players and incentive for picking the game back up.

air patriots push

They wanted to know when the best time to send the notification would be. So they tested a few different variables and found that the best time was 3 days after the last game play.

They also found that sending the notification 7 days after the last game play negatively impacted their performance metrics.

With these 5 tests and probably a few more that have been happening off the record, the team was able to develop a great app for their customers and steadily increase their revenue. At the end of the video Carol gives a few key takeaways for marketers who are a/b testing their apps.

You might also like:

Email Research Chart: Email opens trends on mobile devices in 2015

Mobile Marketing Chart: Amount of revenue from the mobile channel, by merchant type

3 A/B Testing Case Studies from Smart Brand-Side Marketers

Expert Interview: How Humana uses Voice of Customer data and Creates “Super Tests” to drive customer engagement

loveridge_title-screenMike Loveridge is the Head of Digital Conversion Optimization at Humana. In his position, he has conducted more than 300 tests, and achieved a 70%+ win rate. No matter how you slice it, he is an expert in the field of conversion rate optimization.

Back in February, Courtney Eckerle, Managing Editor, MarketingSherpa sat down with Loveridge at the MarketingSherpa Summit in Las Vegas to talk about Humana’s testing and optimization practices.

In this short interview, Loveridge discusses the benefits of Voice of Customer data, and how Humana creates “Super Tests” by testing the component parts of webpages before doing a radical redesign.

You might also like:

Humana on the power of iterative testing

Homepage Optimization: Tips to ensure site banners maximize clickthrough and conversion

Conversion Rate Optimization: Building to the Ultimate Yes