The Need for Causal Inference in Observational Data

In data science, determining whether an intervention caused a change in an outcome goes beyond correlation; this is the essence of causal inference. Observational data, unlike randomized experiments, often suffer from confounding, where certain variables affect both treatment assignment and outcomes. For instance, in marketing, loyal customers may be more likely to receive a promotion and more likely to purchase regardless. Comparing their purchase rates naively to those of other customers would overstate the promotion’s impact. To draw valid causal conclusions, data scientists need methods to adjust for these confounders and approximate a fair comparison between groups. This is where propensity scores become a crucial tool.

What is a Propensity Score?

A propensity score shows how likely a unit (e.g., a customer) is to receive a treatment given its observed characteristics. Formally, if ^W indicates treatment and ^X represents covariates, the propensity score for an individual with features X=x is:

e(x) = P(W=1| X=x)

Typically, propensity scores are estimated using logistic regression or other classification methods to predict treatment assignment from covariates. The key intuition is that two individuals with the same propensity score have similar probabilities of receiving treatment, given their observed characteristics. Therefore, researchers can more credibly attribute differences in outcomes between them to the treatment itself.

Example Scenario: Email Campaign Effect on Purchases

Imagine an online retailer ran a promotional email campaign to encourage purchases. Unlike a randomized experiment, the marketing team targeted specific customers, for instance, those who had shopped recently or spent more in the past, while others did not receive the email. Our goal is to estimate the causal effect of the email on purchase behavior: did the email actually increase the purchase rate, or were other factors at play?

Observational Challenges

This estimation is challenging because the treated customers (email recipients) differ systematically from the control group (non-recipients). Recent shoppers and high spenders are more likely to be targeted, indicating that pre-existing differences, such as recency of purchase and spending history, act as confounders. Ignoring these can bias the campaign’s estimated effect.

Dataset Overview

Suppose we have data on a cohort of customers, including:

Treatment: Whether they received the promo email (Yes/No)
Outcome: Whether they purchased in the subsequent period
Covariates:
- Recency: Months since last purchase
- History: Total spending in the past year
- Newbie: Whether the customer has been new in the past year
- Channel: Primary channel of past purchases (e.g., web, phone)

In our dataset:

35% of customers were treated (received the email)
65% were controls (did not receive the email)

Covariate Differences Between Groups

Covariate	Treated (Email)	Control (No Email)
Recency (months since last purchase)	4.4	6.7
History (past year spending $)	765	490
Newbie (% new customers)	9.8%	20.7%

As expected, the treated group consists of more active and valuable customers:

They purchased more recently (4.4 months vs. 6.7 months)
They spent more in the past year ($765 vs. $490)
They are less likely to be new customers (9.8% vs. 20.7%)

These differences indicate substantial confounding.

Naive Outcome Comparison

If we compute the purchase rate without adjustment:

Treated customers: 6% purchased
Control customers: 0% purchased

This gives a raw difference of ~6.6 percentage points, suggesting a large effect. However, much of this difference may be due to pre-existing characteristics. Treated customers were inherently more likely to buy because they were more engaged to begin with. Without adjusting for confounders, we risk overestimating the true effect of the email campaign.

Propensity Score Matching to Adjust for Confounders

PSM is a common method for adjusting for confounders in observational studies and approximating a causal comparison. The key idea is to pair each treated customer with one or more untreated customers with similar propensity scores, i.e., similar probabilities of receiving the treatment given observed covariates. This process creates a matched sample in which the distribution of covariates is more balanced across the treated and control groups.

Matching on the propensity score balances multiple covariates at once, making the treatment and control groups comparable. Essentially, propensity score matching simulates a randomized experiment after the fact, ensuring that the treated and control groups are similar with respect to observed characteristics.

How Matching Works in the Email Campaign Example?

For the email campaign scenario:

Propensity scores were estimated using logistic regression. The model predicted the probability of receiving the email based on customer covariates:
- Recency (months since last purchase)
- History (past year spending)
- Newbie status (new customer or not)
- Channel (primary purchase channel)
Customers who recently purchased or spent more in the past had higher propensity scores, indicating a higher likelihood of being targeted by the campaign. In contrast, infrequent or low-spending customers had lower propensity scores.
The analysis employed one-to-one nearest-neighbor matching, pairing each treated customer with the control customer whose propensity score was closest. The analysis excluded customers without suitable matches (e.g., very high-spending individuals with no comparable controls). Using a subset of the control group, it matched most treated customers.

Distribution of propensity scores for customers who received the promotional email (treated) vs those who did not (control).

Checking Covariate Balance

After matching, it is important to verify that the treated and control groups are more balanced with respect to observed covariates. The following table summarizes covariate values before and after matching:

Covariate	Treated (Before)	Control (Before)	Treated (Matched)	Control (Matched)
Recency (months)	4.4	6.7	4.4	4.9
History ($)	765	490	765	587
Newbie (%)	9.8%	20.7%	9.8%	8.7%

Interpretation

Recency: The average recency for controls decreased from 6.7 months to 4.9 months after matching, closer to the treated group value of 4.4 months.
Newbie Percentage: The gap between groups virtually disappeared, with both treated and matched controls around 9%.
History (Past Spend): The treated group still had slightly higher spending ($765 vs $587), indicating a residual imbalance, likely due to limited overlap: a few control customers had as high spending as treated customers.

Overall, propensity score matching substantially improved covariate balance, making the treated and control groups more comparable and reducing bias in the estimated effect of the email campaign.

Covariate comparison before and after propensity score matching.

Estimating the Treatment Effect

Using the matched sample of comparable customers, analysts can estimate the treatment effect more credibly. In this example:

Purchase rate among matched treated customers: 10.6%
Purchase rate among matched control customers: 5.6%

This results in an estimated lift of approximately 5 percentage points attributable to the email campaign. This effect is lower than the naive 6.6-point difference observed before matching because confounding factors inflated the earlier result.

After matching, the analysis compares each emailed customer to a similar customer who did not receive the email. Therefore, the difference in outcomes is more likely due to the email itself rather than pre-existing purchase propensities. Propensity score matching reduces bias in the estimate by controlling for observed confounders. The analysis indicates that the promotional email had a positive effect on purchases, though the impact is more modest than a raw comparison would suggest.

Limitations of Naive Propensity Score Methods

While propensity score methods are a powerful tool for estimating causal effects in observational studies, they come with important limitations and assumptions.

#1. Only Adjust for Observed Confounders

Propensity score methods adjust only for observed variables included in the model. Any unmeasured factors that influence both treatment and outcome, such as customer satisfaction or offline purchasing behavior, can still bias the results. This means correlation does not imply causation unless the analysis accounts for all major confounders.

#2. Overlap (Common Support) Requirement

For these methods to work reliably, each treated unit should have a comparable control unit with a similar propensity score, and vice versa. If some treated customers are unique (for example, very high propensity with no similar controls), analysts cannot reliably estimate their effect. During matching, analysts may exclude unique units, which limits the results to matchable customers. Lack of overlap can also increase uncertainty in other propensity score approaches that use weighting.

#3. Sensitivity to Model Specification

Propensity scores depend on correctly modeling the relationship between covariates and treatment. Leaving out an important variable or misrepresenting a relationship (e.g., treating a non-linear effect as linear) can leave residual confounding. Choosing the appropriate covariates requires care; too few can introduce bias, too many can add noise or worsen balance. Checking balance diagnostics is essential.

#4. Not a Guarantee of Unbiased Estimates

Propensity scores aim to emulate a randomized experiment, but this is possible only under the assumption of “no unmeasured confounders” (conditional ignorability). Even after matching, hidden bias may persist. Therefore, interpret the results as associational estimates rather than definitive proof of causality, and conduct sensitivity analyses whenever possible.

Key Takeaway:
Propensity score methods reduce bias from observed confounders but cannot fully eliminate it. Be cautious when making causal claims from observational data.

Final Thoughts

Propensity score matching is a practical technique for estimating causal effects in observational studies. It adjusts for known confounders and improves covariate balance, providing a more credible estimate than a naive comparison. Although it accounts only for observable factors and has limitations, it underscores the importance of controlling for confounders in causal analysis. Propensity scores are one tool among many in causal inference. Methods like difference-in-differences, instrumental variables, and regression discontinuity can complement or address their limitations. Advanced approaches, such as doubly robust estimators and Bayesian methods, further refine estimates. Thoughtful use of propensity scores enables more reliable inference about causal effects from observational data.

Author Bio: Dharmateja Priyadarshi Uddandarao

Dharmateja Priyadarshi Uddandarao is an accomplished data scientist and statistician, recognized for integrating advanced statistical methods with practical economic applications. He has led projects at leading technology companies, including Capital One and Amazon, applying complex analytical techniques to address real-world challenges. He currently serves as a Senior Data Scientist–Statistician at Amazon.

Quiz Result
Total Questions	Correct Answers	Wrong Answers	Percentage

The Need for Causal Inference in Observational Data

What is a Propensity Score?

Example Scenario: Email Campaign Effect on Purchases

Observational Challenges

Dataset Overview

Covariate Differences Between Groups

Naive Outcome Comparison

Propensity Score Matching to Adjust for Confounders

How Matching Works in the Email Campaign Example?

Checking Covariate Balance

Interpretation

Estimating the Treatment Effect

Limitations of Naive Propensity Score Methods

#1. Only Adjust for Observed Confounders

#2. Overlap (Common Support) Requirement

#3. Sensitivity to Model Specification

#4. Not a Guarantee of Unbiased Estimates

Final Thoughts

Recommended Articles

Follow us!

APPS

Blog

Courses

Email

Propensity Scores Balancing Downstream Causal Measurement

The Need for Causal Inference in Observational Data

What is a Propensity Score?

Example Scenario: Email Campaign Effect on Purchases

Observational Challenges

Dataset Overview

Covariate Differences Between Groups

Naive Outcome Comparison

Propensity Score Matching to Adjust for Confounders

How Matching Works in the Email Campaign Example?

Checking Covariate Balance

Interpretation

Estimating the Treatment Effect

Limitations of Naive Propensity Score Methods

#1. Only Adjust for Observed Confounders

#2. Overlap (Common Support) Requirement

#3. Sensitivity to Model Specification

#4. Not a Guarantee of Unbiased Estimates

Final Thoughts

Recommended Articles