From ATE to Uplift Modeling

The core question

In a randomised controlled trial (RCT) the standard output is the Average Treatment Effect (ATE): one number telling you how much the treatment moved the outcome on average. That is useful for a yes/no policy decision β€” does the drug work? should we roll out this programme?

But many real decisions are not yes/no. They are who: who should receive the email, the discount, the intervention? The ATE cannot answer that. Uplift modelling can.

NoteATE vs Uplift β€” one sentence each

ATE: β€œOn average, the treatment increased purchases by 14 percentage points.”

Uplift: β€œFor this specific customer, the treatment is predicted to increase their purchase probability by 40pp β€” email them. For this other customer, it is predicted to decrease it by 10pp β€” do not email them.”


The synthetic dataset

We simulate a retail email campaign. 100 customers are randomly assigned: 50 receive a discount email (treatment), 50 receive nothing (control). The outcome is whether they purchase within 7 days.

The single feature is recency β€” a score from 1 (lapsed customer, hasn’t bought in months) to 5 (very active, bought last week).

set.seed(42)

# Ground truth purchase probabilities by recency and treatment arm
probs <- tribble(
  ~recency, ~p_treated, ~p_control,
  1,        0.20,       0.00,
  2,        0.50,       0.10,
  3,        0.60,       0.40,
  4,        0.80,       0.80,
  5,        0.90,       1.00
)

# Simulate 10 people per recency group per arm (100 total)
sim_group <- function(rec, p_t, p_c, n = 10) {
  bind_rows(
    tibble(recency = rec, treatment = 1,
           purchased = rbinom(n, 1, p_t)),
    tibble(recency = rec, treatment = 0,
           purchased = rbinom(n, 1, p_c))
  )
}

df <- pmap_dfr(probs, ~ sim_group(..1, ..2, ..3))

glimpse(df)
Rows: 100
Columns: 3
$ recency   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ treatment <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ purchased <int> 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

The four customer segments

Uplift modelling recognises that customers fall into four fundamentally different groups with respect to any treatment:

The four uplift segments
Segment Without treatment With treatment What to do
Persuadable Does not buy Buys Always target
Sure thing Buys Buys Skip β€” waste of budget
Lost cause Does not buy Does not buy Skip β€” nothing works
Sleeping dog Buys Does not buy Never target β€” you hurt them

In our data, lapsed customers (recency 1–2) are persuadables, active customers (recency 4) are sure things, and very active customers (recency 5) are sleeping dogs β€” the email interrupts a purchase they would have made anyway and slightly annoys them.


Step 1 β€” The classic ATE

The ATE is a simple difference in means between the two arms.

ate <- mean(df$purchased[df$treatment == 1]) -
       mean(df$purchased[df$treatment == 0])

cat("ATE =", round(ate, 3))
ATE = 0.1

We can also recover it from a linear probability model or logistic regression:

lm_ate <- lm(purchased ~ treatment, data = df)
summary(lm_ate)$coefficients
            Estimate Std. Error  t value     Pr(>|t|)
(Intercept)     0.54 0.06989788 7.725556 9.722408e-12
treatment       0.10 0.09885054 1.011628 3.142070e-01
WarningWhat the ATE hides

The ATE is the weighted average of every individual’s treatment effect. It mixes together +40pp persuadables and βˆ’10pp sleeping dogs and reports the average. This is perfectly valid for a population-level policy decision, but useless for individual targeting.


Step 2 β€” The T-learner (baseline uplift model)

The simplest uplift approach trains two separate models β€” one on the treated arm, one on the control arm β€” and subtracts their predictions.

# Model on treated customers only
m_treat <- glm(purchased ~ recency,
               data   = df |> filter(treatment == 1),
               family = binomial)

# Model on control customers only
m_ctrl  <- glm(purchased ~ recency,
               data   = df |> filter(treatment == 0),
               family = binomial)

# Predict for all customers under both scenarios
df <- df |>
  mutate(
    mu1_tlearner = predict(m_treat, newdata = df, type = "response"),
    mu0_tlearner = predict(m_ctrl,  newdata = df, type = "response"),
    uplift_tlearner = mu1_tlearner - mu0_tlearner
  )

# Average uplift score by recency group
df |>
  group_by(recency) |>
  summarise(
    mean_uplift = round(mean(uplift_tlearner), 3),
    .groups = "drop"
  )
recency mean_uplift
1 0.131
2 0.188
3 0.135
4 0.042
5 0.004
df |>
  group_by(recency) |>
  summarise(
    uplift    = mean(uplift_tlearner),
    .groups   = "drop"
  ) |>
  ggplot(aes(x = factor(recency), y = uplift,
             fill = uplift > 0)) +
  geom_col(width = 0.6) +
  geom_hline(yintercept = 0, linewidth = 0.4) +
  scale_fill_manual(values = c("TRUE" = "#1D9E75", "FALSE" = "#E24B4A"),
                    guide = "none") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title    = "T-learner uplift scores by recency group",
    subtitle = "Green = email helps, Red = email hurts",
    x        = "Recency score (1 = lapsed, 5 = very active)",
    y        = "Predicted uplift (pp)"
  ) +
  theme_minimal(base_size = 13)

NoteLimitation of the T-learner

Each model is optimised to predict its own arm’s outcome, not the difference. The subtraction compounds the noise from both models. In small samples this produces unstable uplift scores even when each individual model fits well.


Step 3 β€” The Doubly Robust (DR) estimator

The DR estimator uses both an outcome model and a propensity score model, combined in a single correction formula. It is consistent if either model is correctly specified β€” two safety nets instead of one.

The two models

# --- Propensity model: P(T=1 | X) ---
# In a perfect RCT this is always 0.5, but we estimate it anyway
# to show the principle (and because you need this for observational data)
ps_model <- glm(treatment ~ recency,
                data   = df,
                family = binomial)

p_hat <- predict(ps_model, type = "response")

# --- Outcome model: E(Y | X, T) ---
# Note: treatment IS in this model
out_model <- glm(purchased ~ recency + treatment,
                 data   = df,
                 family = binomial)

# Counterfactual predictions
df1 <- df |> mutate(treatment = 1)   # everyone treated
df0 <- df |> mutate(treatment = 0)   # everyone control

mu1 <- predict(out_model, newdata = df1, type = "response")
mu0 <- predict(out_model, newdata = df0, type = "response")

The DR formula

\[ \hat{\tau}_{DR} = \frac{1}{n} \sum_{i=1}^{n} \left[ \underbrace{(\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i))}_{\text{outcome model}} + \underbrace{\frac{T_i(Y_i - \hat{\mu}_1(X_i))}{\hat{p}(X_i)}}_{\text{treated correction}} - \underbrace{\frac{(1-T_i)(Y_i - \hat{\mu}_0(X_i))}{1 - \hat{p}(X_i)}}_{\text{control correction}} \right] \]

T <- df$treatment
Y <- df$purchased

dr_scores <- (mu1 - mu0) +
             (T * (Y - mu1)) / p_hat -
             ((1 - T) * (Y - mu0)) / (1 - p_hat)

ATE_dr <- mean(dr_scores)
cat("DR estimate of ATE =", round(ATE_dr, 3))
DR estimate of ATE = 0.1
TipWhy the correction terms work
  • If the outcome model is perfect: residuals (Y - mu1) and (Y - mu0) are zero everywhere, so the propensity model drops out completely.
  • If the propensity model is perfect: the reweighting is exact and corrects for any outcome model error in expectation.
  • You only fail if both models are wrong simultaneously.

What IPW weights are β€” and why they differ

A natural question is whether you could just pass p_hat as weights into the outcome model. No β€” and the direction matters:

# WRONG: weights = p_hat
# This upweights people likely to be treated, amplifying selection bias

# CORRECT: Inverse Probability Weighting (IPW)
# Upweights people who were *unlikely* to end up in the arm they are in
ipw_weights <- ifelse(T == 1,
                      1 / p_hat,
                      1 / (1 - p_hat))

# IPW estimator (singly robust β€” only safe if propensity model is correct)
ipw_model <- lm(purchased ~ treatment,
                data    = df,
                weights = ipw_weights)

ATE_ipw <- coef(ipw_model)["treatment"]
cat("IPW estimate of ATE =", round(ATE_ipw, 3))
IPW estimate of ATE = 0.1

Method comparison

results <- tribble(
  ~Method,                ~`ATE estimate`, ~`Models needed`,     ~`Robust if outcome model wrong`, ~`Robust if propensity model wrong`,
  "Difference in means",  round(ate, 3),   "None",               "β€”",                              "β€”",
  "T-learner (2x GLM)",   round(mean(df$uplift_tlearner), 3), "2 outcome models", "No",  "β€”",
  "IPW",                  round(ATE_ipw, 3),"1 propensity model","Yes",                            "No",
  "Doubly Robust",        round(ATE_dr, 3), "1 outcome + 1 propensity", "Yes",                    "Yes"
)

results |>
  kable(caption = "Comparison of treatment effect estimators") |>
  kable_styling(bootstrap_options = c("striped", "hover")) |>
  row_spec(4, bold = TRUE, background = "#EAF3DE")
Comparison of treatment effect estimators
Method ATE estimate Models needed Robust if outcome model wrong Robust if propensity model wrong
Difference in means 0.1 None β€” β€”
T-learner (2x GLM) 0.1 2 outcome models No β€”
IPW 0.1 1 propensity model Yes No
Doubly Robust 0.1 1 outcome + 1 propensity Yes Yes

Targeting simulation β€” why uplift scores matter

If you can only email 20 customers (40% of your list), which 20 do you pick?

set.seed(42)
# Add uplift scores from T-learner (used for targeting)
df <- df |>
  arrange(desc(uplift_tlearner)) |>
  mutate(rank = row_number())

# Strategy 1: random targeting
random_targets    <- df |> sample_n(20)

# Strategy 2: highest raw response rate (naive ML mistake)
df_response_rank  <- df |> arrange(desc(mu1_tlearner))
response_targets  <- df_response_rank |> slice_head(n = 20)

# Strategy 3: highest uplift score
uplift_targets    <- df |> slice_head(n = 20)

# Incremental purchases = purchases in targeted group - what they would have
# bought without treatment (counterfactual from control model)
incremental <- function(targets) {
  sum(targets$purchased) -
    sum(predict(m_ctrl, newdata = targets, type = "response"))
}

tibble(
  Strategy                   = c("Random", "Highest response rate", "Highest uplift score"),
  `Incremental purchases`    = round(c(
    incremental(random_targets),
    incremental(response_targets),
    incremental(uplift_targets)
  ), 1)
) |>
  kable(caption = "Incremental purchases from 20 emails β€” three strategies") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  row_spec(3, bold = TRUE, background = "#EAF3DE")
Incremental purchases from 20 emails β€” three strategies
Strategy Incremental purchases
Random 3.4
Highest response rate 1.0
Highest uplift score 3.8

Summary

Classic RCT / ATE Uplift modelling
Question answered Does treatment work on average? For whom does it work?
Output One number A score per individual
Use case Policy decisions Targeting decisions
Baseline method Difference in means T-learner (two GLMs)
Production method Regression / IV Causal forest (grf)
Formally robust variant β€” Doubly Robust estimator
ImportantThe key insight

The ATE is the weighted average of every individual’s uplift score. Uplift modelling decomposes that average back into its parts, letting you identify persuadables (email them), sure things (skip β€” waste of budget), and sleeping dogs (never email β€” you actively hurt them).


Further reading