Propensity Score Methods: PSM, IPTW, and AIPW

Methodological Foundations & Advanced HTA Practice

Author

Xiaoge Zhang, PhD

Published

June 1, 2026

Bridging the Gap Between Observational Real-World Data and Causal Claims in HTA

1 Introduction: The Confounding Challenge & Self-Selection Bias

In observational studies, treatment assignment (\(D\)) is rarely random. Clinicians frequently prescribe therapies based on patient characteristics (\(X\)), such as baseline disease severity or age, which are also directly linked to the clinical outcome (\(Y\)). This non-random assignment leads to Selection Bias (Confounding by Indication).

For instance, in oncology appraisals, a newly launched therapy is often reserved for patients who have failed multiple lines of prior therapy or have a poor performance status. If we directly compare the survival outcomes of patients receiving the new therapy against those on standard of care (SOC), the naive treatment effect will be severely biased downward. The new drug’s efficacy is masked by the fact that its recipient group had a much worse prognosis at baseline.

1.1 The Dimension Reduction Problem

To estimate a causal effect, we must control for the baseline covariates \(X\). If we attempt to match patients on multiple characteristics simultaneously (e.g., matching a treated patient with a control patient who has the exact same age, gender, biomarkers, and comorbidities), we quickly encounter the Curse of Dimensionality. As the number of covariates increases, the probability of finding an exact match in a finite sample approaches zero.

Propensity Score Methods solve this problem by collapsing the high-dimensional covariate space \(X\) into a one-dimensional summary metric: the Propensity Score (PS).

2 Propensity Score Matching (PSM)

2.1 Core Logic & The S-Curve

The Propensity Score \(e(X)\) is defined as the conditional probability of receiving treatment given a vector of observed baseline covariates \(X\): \[e(X) = P(D = 1 \mid X)\]

Because probability is bounded between 0 (never treated) and 1 (always treated), a standard linear model (\(D = \beta X + \epsilon\)) is inappropriate as it can predict values outside this range. We therefore model the propensity score using a Logistic Regression (Logit) model: \[P(D=1 \mid X) = \frac{1}{1 + e^{-z}}\] where \(z\) is the linear combination of the confounders: \[z = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k\]

Taking the natural logarithm of the odds (the Logit transformation) yields a linear representation: \[\ln\left(\frac{P(D=1 \mid X)}{1 - P(D=1 \mid X)}\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_k X_k\]

By matching treated and control patients who have the same (or very similar) propensity scores, we neutralize the baseline covariates. Within a matched block, the distribution of covariates \(X\) is balanced between the treated and control groups, effectively simulating a Pseudo-Randomized Controlled Trial (Pseudo-RCT).

2.2 Matching Constraints & ATT Estimation

Caliper Width: When matching, we do not require exact propensity score matches. Instead, we specify a tolerance zone or caliper (typically \(0.2\) times the standard deviation of the logit of the propensity score). If a treated patient’s propensity score is \(0.65\), we only match them with a control patient whose score falls within the caliper range.
Common Support: Propensity score matching is only valid within the region of Common Support—the overlap in propensity score distributions between the treated and control groups. Patients lying outside this overlap (e.g., treated patients with extremely high propensity scores who have no comparable controls) must be discarded.
ATT Estimation: By starting with the treated patients and searching for control twins, PSM estimates the Average Treatment Effect on the Treated (ATT) by default: \[ATT = E[Y^{(1)} - Y^{(0)} \mid D=1]\] In HTA appraisals, the ATT is highly relevant because payers want to know the therapeutic benefit specifically for the patient population that clinicians would target in real-world practice.

3 Inverse Probability Treatment Weighting (IPTW)

While PSM discards unmatched patients, Inverse Probability Treatment Weighting (IPTW) retains the entire sample. It balances the groups by assigning each patient a weight (\(W\)) inversely proportional to the probability of receiving the treatment they actually received.

3.1 The Weighting Mechanism

The Horvitz-Thompson weights for estimating the Average Treatment Effect (ATE) across the entire population are defined as: \[W_i = \frac{D_i}{e(X_i)} + \frac{1 - D_i}{1 - e(X_i)}\]

Treated Patients (\(D_i = 1\)): Weighted by \(1/e(X_i)\). A treated patient with a low probability of treatment (rare) represents a larger, unobserved segment of similar patients and is weighted up.
Control Patients (\(D_i = 0\)): Weighted by \(1/(1-e(X_i))\).

This weighting process creates a Weighted Pseudo-Population where the baseline covariate distributions in both groups are identical, allowing us to estimate the ATE: \[ATE = E[Y^{(1)} - Y^{(0)}]\]

3.2 Weighted Least Squares (WLS) Algebra

In practice, the treatment effect is estimated using Weighted Least Squares (WLS). The objective is to minimize the weighted sum of squared residuals: \[\sum_{i=1}^n W_i (Y_i - \hat{Y}_i)^2\]

In matrix notation, the normal equation for the regression coefficients is modified from the standard OLS \((X'X)^{-1}X'Y\) to: \[\hat{\beta}_{WLS} = (X'WX)^{-1}X'WY\]

3.3 Statistical Inference & Robust Standard Errors

A common misconception in observational studies is that the standard error of the weighted treatment effect can be computed by treating the weighted groups as two independent samples (e.g., using the simple addition of independent normal variances: \(SE = \sqrt{Var(\bar{Y}_1) + Var(\bar{Y}_0)}\)). In HTA practice, this naive approach is strictly rejected for three reasons:

Non-Independence: The treated and control groups are not independent samples drawn from separate populations. They are subgroups of the same observational cohort, meaning their baseline covariates and potential outcomes share a joint distribution.
Variance Inflation: Weighting introduces unequal weights (\(W_i \neq 1\)), which violates the homoscedasticity assumption of standard OLS. Failing to adjust for this leads to severely underestimated standard errors and inflated Type I error rates.
Estimation Uncertainty of Propensity Scores: The propensity scores \(e(X)\) are not true population parameters; they are estimated from the data. This estimation process introduces auxiliary uncertainty that must propagate into the final treatment effect variance.

Technical Math: M-Estimation & The Sandwich Variance Estimator

To obtain asymptotically correct standard errors, modern causal inference relies on the M-Estimation (Empirical Sandwich Covariance) framework. We stack the estimating equations for the treatment assignment model (propensity score) and the outcome regression model together as a joint system:

\[\sum_{i=1}^n \Psi_i(\theta) = 0\]

where \(\theta = [\beta, \alpha]^T\) is the combined parameter vector containing the propensity model coefficients \(\beta\) (logistic regression) and the outcome model coefficients \(\alpha\) (WLS regression).

The stacked estimating equations \(\Psi_i(\theta)\) for individual \(i\) are: \[\Psi_i(\theta) = \begin{bmatrix} X_i \left( D_i - \text{logit}^{-1}(X_i \beta) \right) \\ W_i(\beta) \cdot Z_i \left( Y_i - Z_i \alpha \right) \end{bmatrix}\]

where \(Z_i = [1, D_i]^T\) is the covariate vector of the outcome model, and \(W_i(\beta) = \frac{D_i}{\text{logit}^{-1}(X_i \beta)} + \frac{1 - D_i}{1 - \text{logit}^{-1}(X_i \beta)}\) are the estimated weights.

By Taylor expansion of the joint estimating equations, the asymptotic variance-covariance matrix of \(\hat{\theta}\) is given by the Sandwich Formula:

\[V(\hat{\theta}) = A(\hat{\theta})^{-1} B(\hat{\theta}) A(\hat{\theta})^{-T}\]

The “Bread” Matrix \(A(\hat{\theta})\): The expected Jacobian matrix of the estimating equations: \[A(\hat{\theta}) = - \frac{1}{n} \sum_{i=1}^n \frac{\partial \Psi_i(\theta)}{\partial \theta}\] The off-diagonal blocks of \(A(\hat{\theta})\) capture the dependence of the outcome estimating equations on the propensity score parameters \(\beta\). This explicitly accounts for the uncertainty introduced by estimating the weights.
The “Meat” Matrix \(B(\hat{\theta})\): The empirical variance of the estimating equations: \[B(\hat{\theta}) = \frac{1}{n} \sum_{i=1}^n \Psi_i(\theta) \Psi_i(\theta)^T\] which allows for heteroscedasticity and clustering.

Once the robust variance of the treatment effect coefficient \(\hat{\alpha}_1\) (the ATE estimate) is extracted from the diagonal of \(V(\hat{\theta})\) as \(SE_{\text{robust}}\), the 95% Confidence Interval is constructed using the standard Wald formulation: \[CI_{95\%} = \hat{\alpha}_1 \pm 1.96 \cdot SE_{\text{robust}}\]

This is the exact methodology executed by lm_weightit under the hood, ensuring that the confidence intervals presented in the portfolio are methodologically rigorous and defensible.

3.4 Effective Sample Size (ESS)

Because weighting inflates the influence of certain individuals, the statistical precision of the weighted sample is lower than the nominal sample size. We quantify this information loss using Kish’s Effective Sample Size (ESS): \[ESS = \frac{\left(\sum_{i=1}^n W_i\right)^2}{\sum_{i=1}^n W_i^2}\]

If all weights are uniform (\(W_i=1\)), the ESS equals the actual sample size \(N\). If a few patients have extremely large weights, the denominator increases, and the ESS collapses, signaling that the treatment effect estimate is dominated by a tiny fraction of the population.

4 Augmented Inverse Probability Weighting (AIPW)

Augmented IPTW (AIPW) is a Doubly Robust (DR) estimator that resolves the primary vulnerability of both standard regression models and propensity score weighting. It achieves this by combining an outcome regression (OR) model with a treatment assignment (propensity score) model.

4.1 The “Belt and Suspenders” Philosophy

To construct the AIPW estimator, we fit two distinct models on the observed data:

The Treatment Model (Propensity Score Model, \(g(X)\)): Predicts the probability of receiving treatment (\(D\)) given baseline covariates (\(X\)). Concretely, using a logistic model: \[g(X) = P(D = 1 \mid X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \cdot \text{Age} + \beta_2 \cdot \text{ECOG})}}\]
The Outcome Model (\(m_d(X)\)): Predicts the clinical outcome (\(Y\)) given the treatment assignment (\(D\)) and baseline covariates (\(X\)).
- What is the Outcome \(Y\)? In our simulated oncology scenario, \(Y\) is the 1-year survival indicator—a binary clinical endpoint (\(Y_i = 1\) if patient \(i\) is alive at 1 year, and \(Y_i = 0\) if deceased).
- Concretely, what does the Outcome Model look like? Because \(Y\) is binary, we model the conditional expectation \(E[Y \mid D, X] = P(Y = 1 \mid D, X)\) using a logistic regression model: \[E[Y \mid D, X] = \frac{1}{1 + e^{-(\alpha_0 + \alpha_1 D + \alpha_2 \cdot \text{Age} + \alpha_3 \cdot \text{ECOG})}}\] where \(\alpha_1\) is the conditional log-odds ratio associated with treatment, and \(\alpha_2, \alpha_3\) adjust for baseline confounding.
- How are the counterfactual predictions generated? Once the outcome model is fitted and its coefficients \(\hat{\alpha}\) are estimated, we generate two predicted values for every patient \(i\) in the sample by plugging in the treatment indicator (\(d = 1\) and \(d = 0\)) while keeping their baseline covariates (\(X_i\)) unchanged:
  - Predicted survival under treatment (\(d = 1\)): \[m_1(X_i) = \hat{P}(Y_i^{(1)} = 1 \mid X_i) = \frac{1}{1 + e^{-(\hat{\alpha}_0 + \hat{\alpha}_1 \cdot 1 + \hat{\alpha}_2 \cdot \text{Age}_i + \hat{\alpha}_3 \cdot \text{ECOG}_i)}}\]
  - Predicted survival under control (\(d = 0\)): \[m_0(X_i) = \hat{P}(Y_i^{(0)} = 1 \mid X_i) = \frac{1}{1 + e^{-(\hat{\alpha}_0 + \hat{\alpha}_1 \cdot 0 + \hat{\alpha}_2 \cdot \text{Age}_i + \hat{\alpha}_3 \cdot \text{ECOG}_i)}}\]

The AIPW estimator aggregates these predictions alongside the propensity score weights. The counterfactual mean under treatment (\(d = 1\)) is: \[\hat{Y}_{AIPW}^{(1)} = \frac{1}{n} \sum_{i=1}^n \left[ m_1(X_i) + \frac{D_i (Y_i - m_1(X_i))}{g(X_i)} \right]\]

Similarly, the counterfactual mean under control (\(d = 0\)) is: \[\hat{Y}_{AIPW}^{(0)} = \frac{1}{n} \sum_{i=1}^n \left[ m_0(X_i) + \frac{(1 - D_i) (Y_i - m_0(X_i))}{1 - g(X_i)} \right]\]

The doubly robust Average Treatment Effect (ATE) estimator is the difference: \[\hat{ATE}_{AIPW} = \hat{Y}_{AIPW}^{(1)} - \hat{Y}_{AIPW}^{(0)}\]

4.2 Structural Anatomy of the Estimator

Let’s dissect the mathematical structure of \(\hat{Y}_{AIPW}^{(1)}\) to understand its components:

\[\hat{Y}_{AIPW}^{(1)} = \frac{1}{n} \sum_{i=1}^n \Big[ \underbrace{m_1(X_i)}_{\text{Component A}} + \underbrace{\frac{D_i (Y_i - m_1(X_i))}{g(X_i)}}_{\text{Component B}} \Big]\]

Component A: Base Prediction (G-Computation) This represents the outcome predicted by the outcome model for patient \(i\) under the treatment condition, based on their baseline covariates. If the outcome model is correct, the average of this term across the entire sample is an unbiased estimate of the counterfactual mean.
Component B: Weighted Residual Correction (Bias Adjuster) This term scales the residual error \((Y_i - m_1(X_i))\) of the outcome model for treated patients using the inverse of the propensity score.
- If the outcome model is incorrect, this residual is non-zero. The term captures this systematic prediction bias and adds it back to Component A, correcting the estimate.
- If the propensity score model is correct, the expectation of this correction term exactly offsets any bias in the outcome model.
- Additionally, this component acts as a control variate (a variance reduction technique). By centering the outcome \(Y_i\) around its conditional mean \(m_1(X_i)\), we minimize the variance inflation caused by extreme weights in standard IPTW.

4.3 Step-by-Step Proof of Double Robustness

AIPW is doubly robust because it yields an unbiased estimate if either the propensity score model \(g(X)\) or the outcome regression model \(m_1(X)\) is correctly specified.

Step-by-Step Mathematical Proof of Double Robustness

We prove this using the Law of Iterated Expectations (LIE): \[E[Z] = E_X \big[ E[Z \mid X] \big]\]

4.3.1 Assumptions Required:

Consistency: \(Y_i = D_i Y_i^{(1)} + (1 - D_i) Y_i^{(0)}\). For treated patients (\(D_i = 1\)), the observed outcome is the potential outcome under treatment: \(Y_i = Y_i^{(1)}\).
Conditional Independence (Unconfoundedness): Causal potential outcomes are independent of treatment assignment given baseline covariates: \(Y^{(1)}, Y^{(0)} \perp D \mid X\).

Let \(\hat{g}(X)\) and \(\hat{m}_1(X)\) be our estimated models. The expectation of the AIPW estimator under treatment is: \[E\left[ \hat{Y}_{AIPW}^{(1)} \right] = E_X \left[ E\left[ \hat{m}_1(X_i) + \frac{D_i (Y_i - \hat{m}_1(X_i))}{\hat{g}(X_i)} \;\middle|\; X_i \right] \right]\]

4.3.2 Case 1: Propensity Model is Correct (\(\hat{g}(X) = g(X)\)), Outcome Model is Misspecified (\(\hat{m}_1(X) \neq m_1(X)\))

If the propensity score model is correct, we have \(E[D_i \mid X_i] = \hat{g}(X_i) = g(X_i)\).

Let’s evaluate the conditional expectation of the estimator’s terms given \(X_i\): \[E\left[ \hat{m}_1(X_i) + \frac{D_i (Y_i - \hat{m}_1(X_i))}{\hat{g}(X_i)} \;\middle|\; X_i \right] = \hat{m}_1(X_i) + \frac{E\left[ D_i (Y_i - \hat{m}_1(X_i)) \mid X_i \right]}{g(X_i)}\]

Using consistency (\(D_i Y_i = D_i Y_i^{(1)}\)) and conditioning: \[E\left[ D_i (Y_i - \hat{m}_1(X_i)) \mid X_i \right] = E\left[ D_i Y_i^{(1)} - D_i \hat{m}_1(X_i) \mid X_i \right]\]

By the Conditional Independence assumption (\(Y^{(1)} \perp D \mid X\)): \[E\left[ D_i Y_i^{(1)} \mid X_i \right] = E\left[ D_i \mid X_i \right] \cdot E\left[ Y_i^{(1)} \mid X_i \right] = g(X_i) E\left[ Y_i^{(1)} \mid X_i \right]\]

Similarly, since \(\hat{m}_1(X_i)\) is a deterministic function of \(X_i\): \[E\left[ D_i \hat{m}_1(X_i) \mid X_i \right] = E\left[ D_i \mid X_i \right] \cdot \hat{m}_1(X_i) = g(X_i) \hat{m}_1(X_i)\]

Substitute these expectations back into the numerator: \[E\left[ D_i (Y_i - \hat{m}_1(X_i)) \mid X_i \right] = g(X_i) E\left[ Y_i^{(1)} \mid X_i \right] - g(X_i) \hat{m}_1(X_i) = g(X_i) \left( E\left[ Y_i^{(1)} \mid X_i \right] - \hat{m}_1(X_i) \right)\]

Now substitute this back into the estimator: \[\hat{m}_1(X_i) + \frac{g(X_i) \left( E\left[ Y_i^{(1)} \mid X_i \right] - \hat{m}_1(X_i) \right)}{g(X_i)} = \hat{m}_1(X_i) + E\left[ Y_i^{(1)} \mid X_i \right] - \hat{m}_1(X_i) = E\left[ Y_i^{(1)} \mid X_i \right]\]

Taking the outer expectation over the covariate distribution \(X\): \[E\left[ \hat{Y}_{AIPW}^{(1)} \right] = E_X \Big[ E\left[ Y_i^{(1)} \mid X_i \right] \Big] = E\left[ Y_i^{(1)} \right]\]

Result: The estimate is unbiased. The incorrect predictions of the outcome model (\(\hat{m}_1(X_i)\)) are algebraically canceled out by the weighted residual term.

4.3.3 Case 2: Outcome Model is Correct (\(\hat{m}_1(X) = m_1(X)\)), Propensity Model is Misspecified (\(\hat{g}(X) \neq g(X)\))

If the outcome model is correctly specified, we have \(\hat{m}_1(X_i) = m_1(X_i) = E\left[ Y_i^{(1)} \mid X_i \right]\).

Let’s evaluate the conditional expectation of the residual correction term numerator given \(X_i\): \[E\left[ D_i (Y_i - \hat{m}_1(X_i)) \mid X_i \right] = E\left[ D_i (Y_i^{(1)} - E\left[ Y_i^{(1)} \mid X_i \right]) \mid X_i \right]\]

Conditioning further on treatment assignment \(D_i\) using the Law of Iterated Expectations: \[E\left[ D_i (Y_i^{(1)} - E\left[ Y_i^{(1)} \mid X_i \right]) \mid X_i \right] = E_{D \mid X} \left[ E\left[ D_i (Y_i^{(1)} - E\left[ Y_i^{(1)} \mid X_i \right]) \;\middle|\; X_i, D_i \right] \right]\]

Since \(D_i\) is binary, this expectation only has a non-zero value when \(D_i = 1\): \[P(D_i = 1 \mid X_i) \cdot E\left[ 1 \cdot (Y_i^{(1)} - E\left[ Y_i^{(1)} \mid X_i \right]) \;\middle|\; X_i, D_i = 1 \right]\]

By the Conditional Independence assumption (\(Y^{(1)} \perp D \mid X\)): \[E\left[ Y_i^{(1)} - E\left[ Y_i^{(1)} \mid X_i \right] \;\middle|\; X_i, D_i = 1 \right] = E\left[ Y_i^{(1)} \mid X_i \right] - E\left[ Y_i^{(1)} \mid X_i \right] = 0\]

Since the conditional expectation of the numerator is \(0\): \[E\left[ \frac{D_i (Y_i - \hat{m}_1(X_i))}{\hat{g}(X_i)} \;\middle|\; X_i \right] = 0\]

Therefore, the expectation of the estimator simplifies to: \[E\left[ \hat{Y}_{AIPW}^{(1)} \mid X_i \right] = E\left[ \hat{m}_1(X_i) \mid X_i \right] = E\left[ Y_i^{(1)} \mid X_i \right]\]

Taking the outer expectation over the covariate distribution \(X\): \[E\left[ \hat{Y}_{AIPW}^{(1)} \right] = E_X \Big[ E\left[ Y_i^{(1)} \mid X_i \right] \Big] = E\left[ Y_i^{(1)} \right]\]

Result: The estimate is unbiased. Since the outcome model is correct, the expected residual error is \(0\). The incorrect propensity weights (\(\hat{g}(X_i)\)) are multiplied by zero and fail to introduce any bias.

4.4 Comparison: Why AIPW is Preferred in HTA

Feature	Standard IPTW	AIPW (Augmented IPTW)
Estimator Type	Single-robust (Propensity score only)	Doubly-robust (Propensity score + Outcome regression)
Model Misspecification	Severe bias if PS model is wrong	Unbiased if either model is correct
Variance (Efficiency)	High variance due to extreme weights	Low variance (uses outcome predictions as control variates)
HTA / EAG Acceptance	Moderately accepted; heavily criticized for extreme weights	Highly preferred; provides minimum variance bounds (semiparametric efficiency)

5 R Implementation Pipeline

This pipeline simulates a real-world dataset (\(N=500\)) with a severe confounding by indication trap: high-risk, elderly patients with poor performance status (ECOG=1) are preferentially prescribed the new therapy (Drug X), masking its true survival benefit.

R Code

library(WeightIt)
library(cobalt)
library(survey)
library(AIPW)
library(SuperLearner)

set.seed(2026)
n <- 500

# 1. Simulate Confounders
age <- rnorm(n, mean = 65, sd = 10)
ecog <- rbinom(n, 1, 0.4) # 1 = Poor performance status

# 2. Simulate Treatment Assignment (Confounding by Indication)
# Elderly and severe patients are more likely to receive Drug X (treatment = 1)
ps_true <- plogis(-2 + 0.05 * age + 1.2 * ecog)
treatment <- rbinom(n, 1, ps_true)

# 3. Simulate Survival Outcome (1-Year Survival)
# The treatment coefficient is +0.8 on the log-odds (logit) scale.
# Across the covariate distribution, this corresponds to a true marginal Risk Difference (ATE) of +15.5%.
y_prob <- plogis(1 - 0.03 * age - 1.0 * ecog + 0.8 * treatment)
survival_1yr <- rbinom(n, 1, y_prob)

df <- data.frame(age, ecog, treatment, survival_1yr)

# Calculate Naive (Unadjusted) Estimate
naive_treated <- mean(df$survival_1yr[df$treatment == 1])
naive_control <- mean(df$survival_1yr[df$treatment == 0])
naive_ate <- naive_treated - naive_control

Methodological Note: Log-Odds vs. Risk Difference (ATE)

In observational binary outcome simulations (like 1-year survival), the treatment effect is defined on the log-odds ratio scale (+0.8 in the regression formula) rather than as a direct additive percentage. Because the logit function is non-linear, a log-odds coefficient of 0.8 does not translate to a constant risk difference for all patients.

To determine the true marginal Risk Difference (ATE) for the population, we calculate the difference in expected probabilities under treatment and control states for each individual, and then average them over the baseline covariate distribution: \[ATE = \frac{1}{n}\sum_{i=1}^n \left[ \text{logit}^{-1}(1 - 0.03 \cdot \text{Age}_i - 1.0 \cdot \text{ECOG}_i + 0.8) - \text{logit}^{-1}(1 - 0.03 \cdot \text{Age}_i - 1.0 \cdot \text{ECOG}_i) \right]\]

For this simulated cohort, the true population ATE is exactly +15.5%.

5.1 Baseline Diagnostics (Original Data)

Without adjustment, patients receiving Drug X are significantly older and have worse ECOG scores, creating a naive survival comparison that underestimates the treatment effect:

R Code

# Create baseline characteristics table
n_trt <- sum(df$treatment == 1)
n_ctl <- sum(df$treatment == 0)

mean_age_trt <- mean(df$age[df$treatment == 1])
sd_age_trt <- sd(df$age[df$treatment == 1])
mean_age_ctl <- mean(df$age[df$treatment == 0])
sd_age_ctl <- sd(df$age[df$treatment == 0])

ecog1_trt <- sum(df$ecog[df$treatment == 1] == 1)
ecog1_ctl <- sum(df$ecog[df$treatment == 0] == 1)

surv_trt <- sum(df$survival_1yr[df$treatment == 1] == 1)
surv_ctl <- sum(df$survival_1yr[df$treatment == 0] == 1)

baseline_tbl <- data.frame(
  "Baseline Characteristic" = c(
    "Sample Size, N (%)", 
    "Age (years), Mean (SD)", 
    "ECOG Performance Status = 1, N (%)", 
    "1-Year Survival Rate, N (%)"
  ),
  "Treated (Drug X)" = c(
    sprintf("%d (100%%)", n_trt),
    sprintf("%.1f (%.1f)", mean_age_trt, sd_age_trt),
    sprintf("%d (%.1f%%)", ecog1_trt, 100 * ecog1_trt / n_trt),
    sprintf("%d (%.1f%%)", surv_trt, 100 * surv_trt / n_trt)
  ),
  "Control (Standard of Care)" = c(
    sprintf("%d (100%%)", n_ctl),
    sprintf("%.1f (%.1f)", mean_age_ctl, sd_age_ctl),
    sprintf("%d (%.1f%%)", ecog1_ctl, 100 * ecog1_ctl / n_ctl),
    sprintf("%d (%.1f%%)", surv_ctl, 100 * surv_ctl / n_ctl)
  ),
  "Difference" = c(
    "",
    sprintf("%+.2f", mean_age_trt - mean_age_ctl),
    sprintf("%+.1f%%", 100 * (ecog1_trt / n_trt - ecog1_ctl / n_ctl)),
    sprintf("%+.1f%%", 100 * (surv_trt / n_trt - surv_ctl / n_ctl))
  ),
  check.names = FALSE,
  stringsAsFactors = FALSE
)

knitr::kable(baseline_tbl, caption = "Baseline Covariates and Naive Survival Outcomes (Unadjusted)", align = "lrrr")

Baseline Covariates and Naive Survival Outcomes (Unadjusted)
Baseline Characteristic	Treated (Drug X)	Control (Standard of Care)	Difference
Sample Size, N (%)	416 (100%)	84 (100%)
Age (years), Mean (SD)	66.0 (10.1)	62.5 (10.5)	+3.49
ECOG Performance Status = 1, N (%)	177 (42.5%)	18 (21.4%)	+21.1%
1-Year Survival Rate, N (%)	163 (39.2%)	23 (27.4%)	+11.8%

5.2 IPTW Balancing & Diagnostics

We apply WeightIt to calculate propensity weights and evaluate baseline balance.

R Code

# Estimate weights
W <- weightit(treatment ~ age + ecog,
              data = df,
              method = "ps",
              estimand = "ATE")

# Extract Effective Sample Size (ESS)
ess_summary <- summary(W)
ess_df <- as.data.frame(ess_summary$effective.sample.size)
ess_table <- data.frame(
  "Sample Group" = row.names(ess_df),
  "Control Group (Standard of Care)" = round(ess_df$Control, 2),
  "Treated Group (Drug X)" = round(ess_df$Treated, 2),
  "Total Sample Size" = round(ess_df$Control + ess_df$Treated, 2),
  check.names = FALSE
)
knitr::kable(ess_table, caption = "Effective Sample Size (ESS) Comparison", align = "lrrr")

Effective Sample Size (ESS) Comparison
Sample Group	Control Group (Standard of Care)	Treated Group (Drug X)	Total Sample Size
Unweighted	84.00	416.00	500.00
Weighted	69.45	412.18	481.63

5.3 Covariate Balance Audit (Love Plot)

A standardized mean difference (SMD) below \(0.1\) indicates that the baseline confounding has been successfully balanced between groups.

R Code

love.plot(W,
          thresholds = c(m = 0.1),
          binary = "std",
          abs = TRUE,
          shapes = c("circle filled", "triangle filled"),
          colors = c("#e53e3e", "#3182ce"),
          sample.names = c("Original", "Weighted"))

5.4 Treatment Effect Estimation (IPTW & AIPW)

We fit a weighted regression model using lm_weightit to calculate the IPTW estimate with robust standard errors, followed by doubly robust estimation via AIPW.

R Code

# 1. IPTW Weighted Linear Regression
fit_iptw <- lm_weightit(survival_1yr ~ treatment,
                        data = df,
                        weightit = W)

iptw_summary <- summary(fit_iptw)
iptw_est <- iptw_summary$coefficients["treatment", "Estimate"]
iptw_se <- iptw_summary$coefficients["treatment", "Std. Error"]

# Extract and format IPTW regression results for display
iptw_coefs <- iptw_summary$coefficients
iptw_cis <- confint(fit_iptw)

iptw_tbl <- data.frame(
  Parameter = c("Intercept (Control Survival)", "Treatment Effect (ATE)"),
  Estimate = round(iptw_coefs[, "Estimate"], 4),
  "Std. Error" = round(iptw_coefs[, "Std. Error"], 4),
  "z value" = round(iptw_coefs[, "z value"], 2),
  "p-value" = format.pval(iptw_coefs[, "Pr(>|z|)"], eps = 0.001, digits = 3),
  "95% LCL" = round(iptw_cis[, 1], 4),
  "95% UCL" = round(iptw_cis[, 2], 4),
  check.names = FALSE,
  row.names = NULL
)

# 2. AIPW Estimation
sl_lib_aipw <- c("SL.glm", "SL.mean")

aipw_obj <- AIPW$new(
  Y = df$survival_1yr,
  A = df$treatment,
  W = df[, c("age", "ecog")],
  Q.SL.library = sl_lib_aipw,
  g.SL.library = sl_lib_aipw
)

# Run double robust estimation and suppress console messages/output
suppressMessages(aipw_obj$fit())
invisible(capture.output(aipw_obj$summary()))


# Extract and format AIPW results
aipw_res <- as.data.frame(aipw_obj$result)
aipw_est <- aipw_res["Risk Difference", "Estimate"]
aipw_se <- aipw_res["Risk Difference", "SE"]

# Create a clean data frame for AIPW outputs
aipw_tbl <- data.frame(
  Parameter = row.names(aipw_res),
  Estimate = round(aipw_res$Estimate, 4),
  "Std. Error" = round(aipw_res$SE, 4),
  "95% LCL" = round(aipw_res$`95% LCL`, 4),
  "95% UCL" = round(aipw_res$`95% UCL`, 4),
  "Sample Size (N)" = as.integer(aipw_res$N),
  check.names = FALSE
)

# Output both tables
knitr::kable(iptw_tbl, caption = "Weighted Linear Regression Estimates via IPTW (Robust Standard Errors)", align = "lrrrrrr")

Weighted Linear Regression Estimates via IPTW (Robust Standard Errors)
Parameter	Estimate	Std. Error	z value	p-value	95% LCL	95% UCL
Intercept (Control Survival)	0.2395	0.0479	5.00	< 0.001	0.1456	0.3333
Treatment Effect (ATE)	0.1606	0.0534	3.01	0.00262	0.0560	0.2652

R Code

knitr::kable(aipw_tbl, caption = "Doubly Robust Estimates via Augmented Inverse Probability Weighting (AIPW)", align = "lrrrrr")

Doubly Robust Estimates via Augmented Inverse Probability Weighting (AIPW)
Parameter	Estimate	Std. Error	95% LCL	95% UCL	Sample Size (N)
Risk of Exposure	0.3999	0.0242	0.3526	0.4473	416
Risk of Control	0.2294	0.0487	0.1340	0.3248	84
Risk Difference	0.1705	0.0541	0.0644	0.2766	500
Risk Ratio	1.7433	0.2195	1.1337	2.6807	500
Odds Ratio	2.2387	0.2914	1.2645	3.9635	500

Methodological Note: Epidemiological “Risk” Terminology

In biostatistics and epidemiological software packages (such as AIPW), the term “Risk” is the standard mathematical synonym for the probability of the outcome occurring, regardless of whether the event is clinically positive (e.g., survival) or negative (e.g., mortality).

In this simulation:

Risk of Exposure refers to the predicted probability of survival under the treated state (Drug X): \(P(Y^{(1)} = 1)\).
Risk of Control refers to the predicted probability of survival under the untreated state (Standard of Care): \(P(Y^{(0)} = 1)\).
Risk Difference is the absolute difference in survival probabilities, representing the marginal Average Treatment Effect (ATE): \(P(Y^{(1)} = 1) - P(Y^{(0)} = 1)\).
Risk Ratio is the Relative Risk (RR) of survival under treatment vs. control: \(P(Y^{(1)} = 1) / P(Y^{(0)} = 1)\).

5.5 Evidence Synthesis Comparison

The unadjusted naive estimate is biased downward, while the adjusted IPTW and doubly robust AIPW estimates successfully recover the true treatment effect.

Estimator	ATE Estimate	Standard Error	Audit Status
Naive (Unadjusted)	0.118	N/A	Biased due to selection
IPTW (Weighting Only)	0.1606	0.0534	Confounding adjusted
AIPW (Doubly Robust)	0.1705	0.0541	Double robust / minimum variance

6 Advanced HTA Considerations & EAG Defense

6.1 Boundary Explosions (\(PS \to 0\) or \(1\))

In observational datasets, physician prescribing preferences can result in near-deterministic treatment assignments for certain patient subgroups. This leads to propensity scores that approach \(0\) or \(1\).

The Hazard: For a treated patient with \(PS = 0.01\), the weight is \(W = 100\). This single outlier’s outcome is multiplied a hundredfold in the pseudo-population, destabilizing the clinical outcome model.
The Defense:
1. Stabilized Weights: We scale the weights by the marginal probability of treatment \(P(D)\) in the sample: \[W_{stabilized} = D \cdot \frac{P(D=1)}{PS} + (1 - D) \cdot \frac{P(D=0)}{1 - PS}\] This scales down extreme weights globally.
2. Weight Trimming: If weights remain extreme after stabilization, we apply asymmetric trimming (e.g., trimming the top and bottom 1% of the weight distribution). While trimming introduces a small amount of bias by changing the target population, it significantly reduces variance and stabilizes survival extrapolations, which is generally accepted by NICE EAGs.

6.2 Methodological Trade-offs

Characteristic	Propensity Score Matching (PSM)	Inverse Probability Weighting (IPTW)	Augmented IPTW (AIPW)
Sample Retention	Discards unmatched patients (sample loss)	Retains entire sample via weighting	Retains entire sample via weighting
Causal Parameter	ATT (Effect on the Treated)	ATE (Effect on the Population)	ATE (Effect on the Population)
Sensitivity to Model Misspecification	High (Relies entirely on PS model correctness)	High (Relies entirely on PS model correctness)	Low (Doubly Robust: unbiased if either PS or outcome model is correct)
Efficiency (Variance)	Lower (due to sample size reduction)	Higher (uses all data, but can be skewed by extreme weights)	Maximum (residual correction minimizes variance)
Destabilization Risk	Low (extreme patients excluded naturally)	High (due to extreme weights from poor overlap)	Moderate (stabilized by outcome model residuals)