The Virtue of Experimentation

Shivam Dutt Sharma
15 min readJun 12, 2022
Experimentation — Floor Dialogue

You are a business executive responsible for making operational decisions for your area — be it in terms of leads targeting, service prioritization or customer-product recommendations, you are ultimately trying to improve the user experience, service level, revenue (conversions) and costs.

To achieve this, you are in a constant need to create actions, and while you might generate a good judgment over a period of time about which actions will work well in what situations, in today’s data driven complex business environment, you would definitely want to back your actions with some pre-determined data driven insights. These insights take the form of hypotheses in the world of Experimentation: An hypothesis is essentially a test that helps you confirm or disprove your expectations from an action, by gut-driven judgment. And the way to test this is by running the said action on a smaller sample of population and for a shorter duration, thereby reducing the associated risks (by not running an unproven action on a large population) as well as controlling the costs.

Test & Learn

When the Experimentation is done repeatedly before action implementation across all the actions for multiple business units, teams and operations areas, it leads to a cultural shift known as ‘Test & Learn’.

The shift towards Test & Learn

What is Test & Learn?

Test & Learn is a cultural shift whereby every action, before it is implemented for the full scope, is implemented on a smaller scale as an ‘Experiment’ to learn from. That way, it ushers in a data driven methodology of deciding among the feasible options rather than gut-feel / manual judgment driven decision. The learning from the Experimentation step are to be used later for full-scale implementation of the actions.

Although Test & Learn is optional to any action recommendation, it is fast becoming a norm in the problems that deal with decisions under uncertainty.

What is an Experiment?

An experiment is essentially testing of an hypothesis making sure that the result generated is statically valid and also impactful so as to achieve the Experimenter’s desired goal, based on which, the recommendation with respect to the hypothesis is made. An experiment is defined from the action under consideration. Each experiment would have different versions / variations / treatments of the underlying action, which will then be getting tested over the course of the Experiment.

For example:-
An experiment could be designed by a Bank, to finalize the primary mode of communication for campaigning the launch of a new Credit Card. The bank will first allocate a sample population (limited number of customer accounts) for this experiment. The experiment could have multiple variants and each variant would get an equal share of accounts under them :-

Variant 01 could be sending emails to the customers with a certain demographics. The users targeted under Variant 01 will then become Challenger Group 1 / Experiment Group 1.

Variant 02 could be sending SMS to the customers with some other demographics. The users targeted under Variant 02 will then become Challenger Group 2 / Experiment Group 2.

Variant 03 could be basically performing no action at all(i.e. not communicating about the new credit card, wherein; the customers get to know about the launch on their own by internet search or news coverage). The users targeted under Variant 03 will then become Control Group.

At the end of the Experiment, the Experimenter will track whether the customers responded / took action (which is basically buying the newly launched credit card) more in case of Emails or SMS’s? Or may be if the customers who did not receive any communication bought the newly launched credit cards the most (this is basically the Control group).

This way the bank gets its winning variant.

The analytics will of course establish the winning variant based on which variant contributed to the maximum credit card sales. However, it still stays very important to also check whether the results of the Experiment are statistically significant or not (which in statistical terms would mean checking whether Null Hypothesis is accepted or rejected). It is imperative for an Experimenter to establish that the results of the Experimenter are not just by chance and that there exists an integral relationship between the treatments and the target variable (metric). For statistical significance, there is a suite of Python based tests that can be performed, eg : Chi-square, Fishers Exact, Student’s T-test, etc.

There generally happens to be a decision criteria based on which it is decided that which test will be executed of the lot, for eg: the size of the sample population, the distribution of the metric values (whether discrete or continuous), etc. All of it is explained later in the article.

What is Attribution?

Attribution refers to measuring the impact of parallelly executed experiment variants (including the control variant) in terms of the change in the target variable that is contributed by the variants in their respective environment. The winning variant is then chosen for the full-scale roll-out.

In case of a customer undergoing multiple experiments together, the organization needs to create an attribution model that assigns the credit appropriately to a particular experiment.
However, it is always recommended to not to include one customer in multiple experiments as an industry practice.

Positioning of Experiments in any Business-driven Environment

As executives observe the key metrics of their businesses to track the overall performance, they are motivated to perform a Root Cause Analysis of the underlying problems observed.

At the end of every RCA cycle, executives will have actionable insights from the cohorts of accounts that are essentially generated by the encoded ML algorithms. At this point, they will create actions based on their expert judgment and will run them over a stipulated period of time. These actions will be applied on all the customers in those cohorts, and thus would likely impact tens of thousand of them. This could be a risky proposition because if the action goes wrong it may affect a large population adversely. To avoid this, and also get some data backed evidence on efficacy of the action (in addition to the human experience), we have introduced the Experimentation module. With this, the users can create time-bound Experiments, before the final recommendation of actions, to test the effectiveness of those actions.

The Action — Experiment Relationship

There is an innate relationship between an Experiment and its underlying action. The Experiments are optional to create and as a result, the users can go ahead and recommend actions directly once they are ready with some actionable recommendations post Root Cause Analysis step. However, if an Experiment is to be created then it is obvious that there will have to be an underlying action.

The Experimenters aim to recommend the most significant variant as the potential action at the end of every Experiment.

Above diagram represents the experimentation concept with an example from the Banking domain where the collections of loan repayments for Automotive Loans are under purview in terms of improving the efficacy and efficiency of the collections processes.

These experiments would be run over a period of time (could be shorter in length compared to the original action’s period) to essentially provide statistical evidence on the potential action efficacy.

How does the Anatomy of an Experiment typically look like?

Experiment Name : Unique name that a user can give to the particular Experiment that he/she is creating.

Experiment Description : Description to the particular Experiment he/she is creating.

Experiment Type : Generally, industry-wide, the different organizations have adopted mainly three types of Experimentation methods :-

  • Test vs Control
  • Pre/Post Analysis
  • A/B Testing.

Key Metric : Every experiment will have a metric associated with it. In most of the cases, this will be the primary metric on the node; however the user will have the flexibility to select any other metric as well. Basically all the variants under every Experiment will be expected to show a different metric value. At least that is how an ideal Experiment should behave where every single variant exhibits a metric progression different from each other, based on which the Experimenter can decide which variant to recommend as the final action, at the end of every Experiment.

Experiment Duration : This is the duration for which an Experiment will run. By the universal standards, an Experiment shall be run before any action is recommended. So here the typical duration of any Experiment can be any as there is no constraint, however the Experiment needs to complete before the Action starts. That said, the Experimentation step is optional within Action Management. An action can be recommended without conducting any experiments, should the user be confident about the potential impact it would have on the business.

Experiment Audience : The audience for an Experiment is basically the group of people who will be the subject (customers under study) for the Experiment. Normally, the actions get recommended for the entire cohort of customers. And that is a large number of customers in the picture which can be a potential precarious situation for the Experimenters if the action does not work out well. Since, there is cost and time involved with every potential action, Experimenters feel that it is safe to run Experiments on a sample population, to learn from it and then implement the actions on the full scale once the results from the Experiments confirm their hypotheses.

The user would be expected to provide a percentage (%) input and the system will automatically calculate the exact sample audience size. The users can also take a note of the minimum recommended sample size for every Experiment which the system will recommend to the user while an Experiment is being created. The minimum sample size is calculated based on some attributes like Effect Size, Power, Ratio, Alpha (Confidence Interval), etc.

Experiment Goal : It’s important to associate a goal to every Experiment so that every Experimenter finds it easy to gauge the effectiveness of that Experiment. While any Experiment is in progress, it also becomes possible to show how much any treatment is leading or lagging behind the stipulated goal for the Experiment. The Experimenters can plan to increase / decrease the value of the concerned Experiment metric through an Experiment and that is how they will also provide the input on the front-end.

Experiment Types and Variants : This is where the Experimenters will create the different treatments (test variants) under an Experiment. Depending upon the type of the Experiment, the nature and number of variants will differ.

  • If the Experiment type is “Test vs Control”, there will always be a default Control variant (group) which will be reserved as a no-treatment variant and will only be used to compare the performance of the actual treatments / variants to a non-treatment group. In the “Test vs Control” type of an Experiment, an Experimenter can create any number of test variants.
  • If the Experiment type is “A/B Testing”, an Experimenter can not reserve a Control group however rest all the behavior is same as that of a Test vs Control Experiment. The performance of the different test variants will be compared against each other at the end of every Experiment.
  • If the Experiment type is “Pre/Post Analysis”, an Experimenter creates a single treatment, the performance of which is compared to the performance of the same duration preceding the Experiment (basically the pre-Experiment period).

The Diversification of Experiments

Test Vs Control | A/B Testing

  • Also called Split Testing, it originated from randomized control trials in statistics. Deals with breaking up the test population into parallel groups
  • Mostly used by companies to test new UX features, product versions, and various algorithms that help companies in deciding whether they should go ahead with a new product feature or an action, or not.
  • It requires a primary metric around which experiments are run and their performance is measured.
  • It includes a proper Experiment Design (Power Analysis) and there’s an important calculation of Sample size, Test Duration, etc.
  • Some statistical tests under A/B : T-test, Z-test, Chi-squared test, etc.

Pre/Post Analysis

  • Deals with taking the entire test population and applying the experiment across the time horizon.

Ex: Month 1: No action or action version 1 (e.g. discount of 10%) & Month 2: Action version 2 (discount of 15%) will be compared to draw the conclusion on whether to continue with the action for longer durations

  • Typically used in scenarios where parallel grouping is undesired, multiple treatments / variants are not available/ undesired and the decision to be made is whether permanently shift to a new regime from an older regine
  • It requires a primary metric around which experiments are run and their performance is measured.
  • It includes a proper Experiment Design (Power Analysis) and there’s an important calculation of Sample size, Test Duration, etc.
  • Some statistical tests under A/B : T-test, Z-test, Chi-squared test, etc.

Key Differences between A/B Testing & Pre/Post Analysis

  • In case of A/B is the done across parallel experiments that are taking place
  • In case of Pre/Post it is longitudinal i.e. it is sequential (pre-period vs. post-period)

Statistical Significance behind Experiments

Whenever a user creates an Experiment, there is always a mathematical inference associated with the results of that Experiment. The inferences established from the Experiments will help determine the level of statistical significance associated with the result of that experiment.

In case of Test vs Control :-

  • The Test Variant which will be yielding the best value for the associated metric as compared to the Control Group, will be labeled as Most Significant, if the associated change in the metric of interest is concluded to be not attributed only to the natural variation but a statistically significant difference as proven by the p-value comparison
  • Any Test Variant is Significant, if that test variant is yielding a better value for the associated metric as compared to the Control Group, and is proven as statistically significant as mentioned above
  • Any Test Variant is Not Significant, if that test variant is yielding an inferior value for the associated metric as compared to the Control Group, or even if it is yielding a better value, that value change is proven to statistically insignificant

In case of A/B Testing :-

  • The Test Variant which will be yielding the best value for the associated metric as compared to the other test variants, will be labeled as Most Significant, if also proven to be significant statistically.

In case of Pre/Post Analysis :-

  • If the treatment (the single test variant) yields better metric value as compared to the Pre period, it will be labeled as Significant, if also proven to be statistically significant
  • If the treatment (the single test variant) yields inferior metric value as compared to the Pre period, it will be labeled as Not Significant, or even if it is better marginally, its not proven to be statistically significantly different

However, knowing that any Experiment can have multiple treatments (test variants) under it, it becomes important to also prove that there exists a statistically significant difference between those test variants.

The concept of statistical significance is central to planning, executing and evaluating experiments of all types — A/B, Test & Control and Pre/Post Analysis

Whenever any user creates an Experiment, it is imperative to formulate a hypothesis around it first. It helps in making sure that the interpretation of the results is correct.

In the Experiments world, knowing that the users will be creating treatments and compare them against a control group (in case of Test vs Control), or a previous period (in case of Pre/Post Analysis) or compare the treatments with each other (in case of A/B Tests), it becomes important to also prove statistically at the end of every Experiment that the results are not by any chance, rather, there exists a correlation between the target variable (primary metric of the Experiment) and test variants.To establish such a statistical significance methodology, we have adopted Null / Alternative Hypothesis testing.

The null hypothesis is a baseline assumption that there is no relationship between two data sets — in our case, basically the target variable (primary metric) and the variant data. When a statistical hypothesis test is run, the results either disprove the null hypothesis or they fail to disprove the null hypothesis.

Depending upon a certain criteria, the system will decide which test to perform in order to establish statistical significance. The criteria is simply :-

  • Whether the metric is Discrete or Continuous
  • The Data set is large or small (basically whether more than the minimum recommended size) or not.

As any of these conditions are evaluated, the system has a Decision Tree to follow and to accordingly execute the relevant statistical test in Python.

How will an Experiment be evaluated?

Every experiment will be concluded and the results will be exposed based on pure mathematical grounds. The results will vary for every Experiment type.

Criteria for Test vs Control

  • The variant with the best results as compared to the Control variant will be shown as the Most Significant variant.
  • The variant with performance better than the Control variant will be shown as a Significant variant.
  • The variant with a performance inferior to that of the Control variant will be shown as a Not Significant variant.

Once, the statistical test is performed based on the above Decision Tree which decides from the suite of statistical significance tests (Chi-square Test, Fisher’s Exact Test, Z-Test, etc.) which statistical test fits the best for that Experiment; the system provides an Experiment Monitoring view.

Monitoring your Experiments

Experiments Monitoring is important because it provides insights around the Experiments at a macroscopic level. Here, the user gets to know how are his/her Experiments doing at an overall platform level and also looking at some of the guardrail metrics closely as follows :-

  1. Total Experiments — It is the number of experiments that were conducted in the duration that the user has selected. The system also provides a breakup of the count of experiments for Test vs Control, Pre/Post Analysis & A/B Testing Experiments individually as well.
  2. Experiments to Actions Ratio — It is the ratio of the total concluded experiments in the selected time duration to the count of experiments out of those that got converted / recommended into potential actions. The system also shows the Experiments to Actions ratio for Test vs Control, Pre/Post Analysis & A/B Testing Experiments individually as well.
  3. Successful Experiments (Win Ratio) — It is the number of experiments where the desired goal was achieved by any of the underlying test / treatment variants. The system also shows the total successful experiments separately for Test vs Control, A/B Testing & Pre/Post Analysis at the bottom of the card.
  4. Avg. Experiment Duration — It is the average duration for which the experiments were conducted in the selected time duration. Here, the system also shows the average duration of Test vs Control, A/B Testing & Pre/Post Analysis Experiments individually as well.
  5. Testing Velocity — It is the measure of the average number of test variants per experiment that the user is performing over a certain time period. Here, the system also shows the average number of test variants per experiment for Test vs Control, A/B Testing & Pre/Post Analysis Experiments individually as well.
  6. Avg. Sample Size — It is the average sample size chosen by the Experimenters for all the experiments in the selected time duration. Here, the system also shows the average sample size chosen by the Experimenters for Test vs Control, A/B Testing & Pre/Post Analysis Experiments individually as well.

Why do we recommend to include Experiments in your Decision Automation journey?

Most Product & Service based companies have now realized that data is their product. Whether, as a service company, you are selling a service or as a product company, you are selling a product or content; it has become extremely imperative to create a genuine and long lasting value for your customer. One universally accepted way of doing this is by creating and promoting content and products based on actionable insights.

Most of the data-driven companies are not just passively and storing their data now, instead they are actively generating actionable insights data by running Experiments.

The secret to recommend the best actions to your customers and clients is by doing consistent testing in the form of Test vs Control, A/B Testing & Pre/Post Analysis or any other testing methodology that you discover on the way.

Through time and space controllec Experiments,

The Executives can solve their core business pain points and provide data-driven solutions to the end consumers. Building the Test & Learn culture in their organization, Experimenters can continuously test and observe their metrics of interest which impacts their business at macro & micro levels.

The Executives can make use of the latest and greatest testing methodologies like Test vs Control, A/B Testing, Pre/Post Analysis, etc. and also adapt to any new testing cult as the industry grows and experiments innovate.

There is a flexibility in the system for the Experimenters to choose an ideal sample size for their Experiments and avoid potential risks of running and observing the Experiments on the full customer base (cohorts in our case).

By virtue of robust and efficient suite of statistical algorithms like Chi-Square, Fisher’s Exact, Z-Test, etc. the Experimenters get accurate and precise results and get the freedom to move ahead with only the statistically significant results.
The Decision Tree established for the Experiments, lets the Executives have only the most relevant algos produce the results around the statistical significance of their Experiments. This is based on some criterias of sample size and the type of metric under the Experiment.

With all these benefits into the loop, Experiments is and should be the ideal solution for any Executive to get the desired business and strategic results.

--

--