Case—The RealPro Customer Benefits Program (B): Implementing Covariate Balancing and Difference-in-Differences Analysis
After reviewing the RealPro analysis based on the collected data, concerns were raised whether RealPro customers and the control group could be directly compared. Customers who were already spending a lot of money at Real in 2018 are more likely to join the RealPro program in 2019 as they receive large absolute discounts without any changes to their shopping behavior. This would distort a direct comparison between both groups, as it would be impossible to determine whether the difference in shopping behavior is the result of a RealPro membership or was already present in the data a priori.
The difference-in-differences (DiD) technique can help to remove biases in the comparison of the RealPro and the control group, such as the self-selection bias. Moreover, covariate balancing methods, which can be used in combination with the DiD methodology, can help negate the differences between RealPro and control group customers along characteristics referred to as covariates. The aim of this covariate balancing is to increase the validity of comparisons between both groups. Two methods appear particularly promising in this context: propensity score matching and entropy balancing.
Mr. Uphues and Mr. Laenge were wondering whether analyzing the market test data with these techniques will give different results as compared with a direct comparison between the RealPro and the control group and what impact this will have on the overall assessment of the RealPro program.
The data analysis in this case extension is based on one of two available RealPro market test data sets. When running the DiD analysis without prior covariate balancing, the prebalanced data set with 75 thousand transactions, based on 572 RealPro and control group customers each, should be used. This data set is the same as the one used in the main case. When combining covariate balancing and the DiD analysis, an unbalanced data set with 83 thousand transactions, 572 RealPro customers, and 963 control group customers should be employed.
For both the RealPro group and the control group, transaction data are reported from May 1, 2018, through November 30, 2018, and also for the same time period in 2019. This approach allows one to examine changes in the purchasing behavior of RealPro customers after joining the program. The reported test market data are purposely limited to only seven months because, during the program’s first two months (i.e., March and April 2019), many customers were still in the process of joining the program; therefore, including these months in the data set would result in transaction histories of unequal length. Note also that only those RealPro customers who became members before April 1, 2019, are included. This restriction ensures less volatile demand patterns, since the initial analysis established that many customers increased their purchase volume abnormally in their first month of membership but exhibited a more level demand pattern in subsequent months.
Table 1 describes all features of the two data sets. For each transaction, which is given a unique ID, multiple features are reported. In this data set, the Customer_Group column indicates whether the customer/household is a RealPro member or part of the control group. To match all transactions to their respective customers/households, the customer ID is reported in each case. The store ID is similarly reported to enable identification of which transactions occurred at each of the seven stores used in the market test. The Date column reports the time of purchase (in yyyy-mm-dd format), and there are several other date-related columns. Revenue_Transaction gives the total amount spent by the customer/household on the respective transaction. Although the discount due to high–low promotions has already been deducted from the reported revenue figures, the RealPro discount has not been deducted. The Real team suggests that you do not deduct the RealPro discount when analyzing changes in or associated with the revenue, but only for calculations associated with the profitability of the program. The Num_Items variable captures the number of items purchased in the transaction. This number represents the sum of all items bought, not the number of unique products. In addition to the total revenue generated by each transaction, the data set reports the revenue from the purchase of products discounted under the RealPro program separately from that from products that were discounted as part of Real’s high–low pricing strategy. There is no overlap between these two revenue figures because, in the RealPro program, already promoted products do not receive an additional program discount.
Data Set Description
|The unique transaction ID associated with the purchase
|Indicates whether the customer/household has signed up for the RealPro program (“Pro”) or is part of the control group (“Control”)
|The ID associated with the given customer/household
|The ID of the store where the purchase was made
|The date of the purchase (yyyy-mm-dd format)
|The year of the purchase
|The month of the purchase
|The week of the purchase
|The day of the week of the purchase
|The revenue generated from the purchasea,b
|The total number of items bought in the purchase
|The revenue generated from the purchase of products discounted as part of RealProa
|The revenue generated from the purchase of products not discounted as part of RealProb
|The revenue generated from the purchase of products that were price promoted as part of the high–low pricing strategyb
aRealPro price discount has not been deducted.
bHigh–low price discount has been deducted.