Geo-based AB Testing and Difference-in-Difference Analysis in Instacart Catalog

Introducing the engineering design and data science model behind the Catalog Team’s geo-based AB testing system.

Published in

tech-at-instacart

12 min readJun 2, 2022

Authors: Xiaoding Krause, Kriston Costa

It is a challenge to AB test a data warehouse ETL improvement, when the north star metrics are at the end user level. If we want to experiment a new data transformation process that produces a new output, any team that builds end user features needs to add indexes to enable the new experience for a subset of users.There would be significant engineering and infrastructure costs associated with maintaining these indices, making it impossible to scale.

At Instacart, the Catalog Team has developed a geo-based AB testing system that enables different user experience through minimum changes within the Catalog Team’s ETLs, and the outputs of the existing and new data transformation process are automatically passed down to the end user located inside different delivery zones. With this design, we are able to experiment by randomizing the delivery zones, and measuring the user metrics through comparing zones with the treatment experience with those of control through a Difference-in-Difference model. This method saves engineering and computation costs, is highly scalable to running dozens of experiments per quarter, and can be automated from end to end through daily ETLs.

Why We Designed Zone Experiments

The Catalog Team at Instacart is the data warehouse of the company, where we ingest, extract, and transform billions of pieces of product information through hundreds of ETLs on a daily basis. The team builds a foundational layer of Instacart’s shopping experience, and provides data for Shop, Enterprise, Fulfillment, and Machine Learning teams.

Figure 1: Catalog data is used by multiple teams across Instacart to enable different user features

We want to measure the causal impact of major projects to make strategic decisions–however, implementing a Catalog AB Testing is not straightforward. The catalog attributes published by the catalog team are indexed and retrieved by multiple teams in the company. If we publish a new attribute–such as a ML-processed product name–and want to experiment on user outcomes, work is required from every engineering team in the company to change the indexes to the new attributes in order to enable the two different versions on the user end.

Figure 2: In order to AB test a new product name produced by the catalog team, every team that uses this attribute needs to develop a new index (red) in order to enable changes on the user.

Figure 3: To avoid downstream changes from every team, upstream change requires the duplicate of the entire catalog for every experiment we run.

Alternatively, we can develop two versions of the complete catalog attributes, and randomly assign the control and treatment version for the downstream teams to retrieve. In this case, none of the downstream teams need to change their indexes. However, for each new experiment, a duplicate of the entire catalog’s billions of attributes is needed, which requires an exorbitant amount of additional data processing and storage, and is not scalable.

Facing the above infrastructure limitation, the team developed a geo-based approach to randomize variants based on delivery zones. A zone can be considered as Instacart’s geographic delivery boundary, and users located in the same zone receive deliveries from the same retailer’s physical stories, whereas users in a different zone would get deliveries from different stores in that retailer’s network. For each of the physical stores, the difference in inventory and availability of the same products are indexed by inventory area IDs in the Catalog tables. This means that if we have two versions of product information, one from the incumbent process and the other from a new process, we can build temporary additional indexes to have a subset of product information pointing to a different catalog version, before the product information is localized to individual physical stores through inventory area IDs. Thus, stores located in different zones are able to retrieve different versions of the product information, and shown to users without any additional work downstreams. This approach only requires duplicating the specific output in experiment instead of the entire catalog, and only for the part of the data indexed by the selected inventory area ids.

Figure 4: Geo-based AB testing only changes the specific piece of information upstream of the catalog ETL process, before product information is localized to the individual physical stores. No work is needed downstream for users to see different versions of catalog product names in the control and treatment zones.

The team also designed the Catalog Holdout with the same methodology. The purpose of building the holdout is to measure the long-term aggregated impact of the Catalog team, and understand the incrementality of each individual experiment. The Holdout is equivalent to testing multiple new catalog processes at the same time, and for an extended period. In the zones selected for the Holdout, users only see catalog features available at the start of the holdout period, while in the non-Holdout zones, users can see features built on data from catalog’s new data processes.

Figure 5: Catalog Holdout can be implemented using geo-based AB testing by publishing attributes generated by processes only available at the start of the Holdout period to Holdout zones.

Difference-in-Difference Analysis

The design of the Catalog AB testing system is to build a shortcut into our data ETLs, but it also comes with complexity in statistical analysis. First, instead of millions of users to randomize, we can focus on the ~1000 delivery zones and they are highly heterogeneous in terms of population, stores and product availability, and user preferences. Second, we have to make assumptions that users in control and treatment zones only see their respective version of the catalog, but in fact, users can shop in a different zone, for example, sending gifts to family living in a different state. Therefore, we expect cross contamination and noise in our experiments. Third, when we randomize based on zones of users, we expect users within the same zone share certain similarities and thus a decrease in the power of the experiment. The intraclass correlation of a clustering sampling has to be corrected for calculating standard errors in our test. Lastly, we want to find a method that can be automated through ETLs, so that there is little overhead for running each new experiment.

Difference-in-Difference so far seems to be the best method in terms of functionality and scalability. A Difference-in-Difference model is based on the assumption that by comparing the longitudinal outcome of control and treatment, the bias in the unobserved variables is canceled, and the difference in outcome of the two groups will be constant over time if there was no treatment. If this parallel assumption holds, and when we add an intervention to the treatment group, any statistically significant changes of the treatment group from the parallel trend after the intervention is the treatment effect.

Figure 6: An ideal scenario of the Difference-in-Difference model, where control and treatment outcome follow a linear parallel trend over time.

When it comes down to user metrics such as orders and purchases, however, the outcomes are not linear but subject to seasonality and macroeconomic factors. Therefore, a panel data design using a two-way fixed effects model was used for analysis, which is equivalent to the use of dummy variables for each zone and time to control for zone and time variances. For each outcome metrics yigt of user i in zone g and at time t, we can model as follows:

Where the coefficient of the interaction terms of treatment and post treatment dummies β3 is the treatment effect of our experiment, and Xi is covariance for controlling additional variables and macroeconomic factors that could change the parallel trend, and ε is the regression error.

Geo Randomization

We want to build the capacity of running multiple experiments at the same time, and we exploit running experiments with a minimum number of zones while getting enough statistical power in our experiments. A nearest neighbor matching and greedy algorithm was deployed to sort similar zones for our experiment variants, and completed with a test of equivalency between the selected treatment and control to validate the parallel trend.

Geo Randomization: Features were built on a zone/day basis for all the visiting users located in a zone on a given day, including both the outcome user metrics, such as orders, purchased items, as well as variables for controlling the differences of performance in each zone, such as tenure of the users, previous purchases of the users, and spend. To match the zones, those features were aggregated at the zone/day level. For example, if we have 20 features and we want to match performances of zones for a period of 14 days, we will have a total of 280 features for each zone. A Principle Component Analysis was used to extract top 15 components that explains the variation in the zones, and the euclidean distances between every two zones we calculated.

To match zones for experiment variants, we used a greedy algorithm to scale to the scenarios of multiple variants in an experiment. For a given N variants (N=2,3,…), we start from the two zones with the least distance, and find the 1 additional zone that gives the least sum of distance among the three zones (distance between 1 and 2, plus distance between 2 and 3, and plus distance between 1 and 3), and then find the next zone that gives the least sum of distance of the four zones, until we get N zones. Each one of the N zones will be randomly assigned to one variant of our experiment, and then we select the next two zones with the least distance that are not yet assigned to variants, and repeat. After this sorting, we have about ~1000/N of N groups of zones. If we want to have 30 zones in each of our experiment variants, we randomly sample 30 of the groups of N zones.

Analysis and Test of Equivalence

We use Difference-in-Difference for both evaluating the performance of geo randomization and analyzing experimental results, and used the statistical package linearmodels in python which easily calculates the clustered standard error in our two-way fixed effect models.

Clustering of Standard Errors: Below is the covariance matrix of a simple OLS estimator, where X is the feature matrix, and ε is the residual of regression.

For a one way clustering sample design, standard error will be inflated by the intraclass correlations inside the clusters. Below is the example of clustering by zone, where Xg is the feature matrix of observations within zone g, and εg is the residual when regressing within zone g, G is the total number of zones, n is the number of observations and k is the number of features in X.

With this cluster covariance matrix, we can calculate the two-way clustered covariance matrix through combining the clustered covariance matrices on the zone Vg(β) , on the time Vt(β) (each group is one day), and on both zone and time Vgt(β) (each group is zone day).

Once we have the clustered covariance matrix for our two-way fixed effect model, we can calculate the power of our experiment with the adjusted standard error from our covariance matrix:

Where δ Is the difference in means, and 𝞼 is the standard error (SE) of regression coefficient β3 based on clustered covariance times the square root of number of data points n:

Test of equivalence: In addition to performing power analysis, before control and treatments are finalized, we want to validate our parallel trend assumption, that besides the effect of zones and time, there is no difference in our treatment and control zones through an AA test. In this test, we build a two-way fixed effect model with our control and treatment zones, and assign a pre and post period as if we have performed AB testing with those zones. A test of equivalence of major metrics is performed based on the model output.

The test of equivalence states that, for a reasonable detection limit of I:

Where β3 is the regression coefficient of our interaction terms, I is a threshold we select based on the projected effect of our treatment, as we want our effect to be greater than the detection limit of I, and CI is the confidence interval of coefficient β3.

The purpose of using the test of equivalence is to rule out the type I errors of a t test: rejecting null hypothesis when the two groups are in fact equal when we compare our matched control and treatment zones. So if the two-way fixed effects model generates a p value of smaller than 0.05 for β3 in the AA test but we can reject the null hypothesis in the test of equivalence, the selected control and treatment zones are considered to have parallel trends with time.

When we analyze the experiment results, we use the standard t test. If the two-way fixed effects model generates a p value of smaller than 0.05 for β3 in the post experiment analysis, the coefficient β3 is a statistically significant treatment effect.

Automation with Daily ETLs

The end-to-end implementation and analysis of a geo-based AB testing model, if not automated, requires a few days of manual work of a skilled data scientist. This means we cannot hand over experiments to PMs, operations or engineers like running user randomized experiments, where Instacart has a platform well-built for easily setting up randomization and computing experiment metrics. To reduce such overhead, we built an automation based on daily ETLs of metrics and calculation, in combination with a web UI for easy configuration.

Daily pipelines: We built user features on a daily basis for all visiting users and from all zones. So whenever there is a need to configure a new experiment, we already have the raw features ready to be queried for matching and randomizing zones for different variants; when we are ready to launch one experiment, we already have the metrics pre-calculated to be queried for analysis.

Web UI: we built a web UI that is user friendly, where PMs, operations or other teams without deep knowledge on geo randomization and Difference-in-Difference analysis can configure an experiment by specifying how many variants, zones, and days they want the experiment to be. When they click ‘Submit’, the web UI calls for functions to query features from the daily pipeline, match and randomize zones, calculate power, and perform a test of equivalency on the default metrics. At the last step of the experiment setup, the script saves finalized experiment assignments into a Snowflake table.

Automated metrics tracking: we built a weekly ETL that picks up any new experiment saved on the Snowflake table above, and uses the zone variant information in the table to query features and metrics from our daily pipelines of metrics, and calculate experiment results with a two-way fixed-effect model. The calculated results are finally displayed by a Periscope dashboard.

Currently, we built a prototype for this automation with all the ETLs completed and running in Airflow.

Conclusions

With the geo-based AB testing, the Catalog team has been able to start experimenting with several key initiatives as well as implementing long-term holdouts. We have successfully built out this 0 to 1 feature within the limits of our infrastructure, minimized the engineering and data science overhead for running each experiment, and paved ways for further automation and integrating into Instacart’s experimentation platform.