Mock Case — Multilevel Network Meta-Regression (ML-NMR)
Population-Adjusted Indirect Comparison for a NICE Appraisal: Quizartinib vs Midostaurin in AML
Author
Xiaoge Zhang, PhD
Published
June 1, 2026
NoteMock Case Disclosure
This is a methodological proof-of-concept. Data are fully synthetic, generated from a calibrated Data Generating Process (DGP) with known ground-truth parameters. The clinical scenario (Quizartinib vs Midostaurin in AML) mirrors the structural constraints of NICE TA1013, but all patient-level numbers are simulated. The final section documents the parallel with the real TA1013 submission.
1 Executive Summary
Simulated Treatment Comparison (STC) suffers from a mathematical fatal flaw when dealing with non-linear clinical endpoints: Ecological Bias. Because \(f(E[X]) \neq E[f(X)]\) for a non-linear link function, plugging aggregate means into a regression model yields biased population-level predictions.
Multilevel Network Meta-Regression (ML-NMR) was specifically designed to solve this problem. Instead of plugging in aggregate means, ML-NMR performs numerical integration over the covariate distribution of the target population via Sobol sequences — correctly computing \(E[f(X)]\) without ecological bias.
This case study demonstrates the full ML-NMR pipeline:
Simulate a realistic IPD + AgD network with a known effect-modifier (age)
Formulate the Joint Likelihood linking both data levels
Diagnose and resolve MCMC convergence failures
Perform counterfactual projection into the target population
Validate parameter recovery against ground truth
Compare structural decisions against the real TA1013 submission
2 Decision Problem & Clinical Background
In acute myeloid leukaemia (AML) with FLT3-ITD mutations, two tyrosine kinase inhibitors (TKIs) compete for market access:
Quizartinib (our drug): IPD available from a company-sponsored trial
Midostaurin (competitor): Only aggregate data (AgD) available from published literature
The HTA challenge is that the two trial populations differ substantially in baseline age — the primary effect modifier for TKI efficacy in AML. A naive indirect comparison would be confounded by this population mismatch. ML-NMR addresses this by jointly modelling both data levels with shared biological parameters, allowing treatment effects to be standardised to any target population.
3 Mock Scenario: Anchored PAIC
The network contains three treatments (Quizartinib, Midostaurin, Placebo):
Trial 1 (IPD available): Quizartinib vs Placebo — younger population (mean age 45)
Trial 2 (AgD only): Midostaurin vs Placebo — older population (mean age 65)
Endpoint: Binary Complete Remission (CR) at induction — modelled via logistic regression
Effect modifier: Age — treatment efficacy declines with age (negative interaction)
\(\beta_{\text{trt}}\), \(\beta_{\text{age:trt}}\): Shared parameters — the mathematical bridges that allow information to flow from the IPD trial to the AgD trial
4.2 IPD Likelihood
For the 600 patients in Trial 1, exact ages are observed, so the likelihood is straightforward Bernoulli:
Since MCMC cannot solve this analytically, ML-NMR approximates using 64-point Sobol sequences — quasi-random “virtual avatars” sampled from the AgD age distribution:
Because both likelihoods share the same \(\beta\), the model finds the equilibrium that simultaneously explains individual-level CR patterns (IPD) and aggregate CR counts (AgD).
5 MCMC Convergence Audit: Three Phases
Running Bayesian evidence synthesis on sparse networks often produces algorithmic failures before convergence is achieved. The following three-phase narrative documents the diagnostic process — a critical skill in real HTA submissions.
Diagnosis: Attempting to fit a complex interaction surface on a minimal IPD sample creates a near-flat likelihood. The MCMC sampler wanders without finding a unimodal posterior — a signal that the data cannot support the complexity of this model.
Resolution: Increase IPD sample size and simplify the interaction structure until the likelihood surface becomes identifiable.
5.2 Phase 2 — Extrapolation Trap (The No-Overlap Crisis)
Symptom: R-hat ~ 11.0; complete chain bifurcation across all parameters.
Diagnosis: An Extrapolation Gap. If the IPD population (e.g., mean age 25) and the AgD population (mean age 70) have zero covariate overlap, the Sobol avatars must extrapolate across a 45-year data vacuum. The non-linear logit mapping amplifies minor slope uncertainties into catastrophic integration errors. The chains bifurcate because the model has two equally plausible but mutually incompatible explanations for the data.
Resolution: Realign the IPD population to establish statistical overlap with the AgD population (e.g., shift mean age from 25 to 45–55). This converts the task from high-risk extrapolation to robust interpolation.
5.3 Phase 3 — Population Alignment & Recovery
Outcome: After establishing common support (Trial 1 mean age 45, Trial 2 mean age 65 — overlapping tails), chains mix perfectly and R-hat converges to 1.00.
ImportantKey Diagnostic Principle
MCMC divergences in NMA are almost always a symptom of geometric incompatibility in the data — not a model specification error. Restoring covariate overlap resolves R-hat failures more reliably than tuning MCMC hyperparameters. Technical stability is ultimately a function of clinical plausibility.
When adjusting for two covariates simultaneously (e.g., Age and WBC count), the Sobol integration space expands to a 2D plane. A critical challenge arises: published AgD trials rarely report the correlation (\(\rho\)) between covariates.
Correlation Borrowing solves this by computing \(\rho\) from the IPD cohort and injecting it into the AgD integration nodes via Cholesky Decomposition:
Extract the empirical covariance matrix from IPD: \(\Sigma = \begin{pmatrix} \sigma^2_{\text{age}} & \rho\sigma_{\text{age}}\sigma_{\text{wbc}} \\ \rho\sigma_{\text{age}}\sigma_{\text{wbc}} & \sigma^2_{\text{wbc}} \end{pmatrix}\)
Apply \(L\) to the standard Sobol grid to stretch the independent 2D grid into an elliptical “probability cloud” that respects biological covariance
This ensures the virtual avatars represent clinically plausible patient profiles — not mathematically impossible combinations (e.g., age 20 with WBC typical of a 70-year-old).
Code
set.seed(2026)# Simulate IPD with two covariatesn_ipd2 <-200ipd_2d <-data.frame(study ="Trial_1",trt =sample(c("Placebo", "Quizartinib"), n_ipd2, replace =TRUE),age =rnorm(n_ipd2, mean =55, sd =10),wbc =rlnorm(n_ipd2, meanlog =3.5, sdlog =0.5)) |>mutate(class =if_else(trt =="Placebo", "Placebo", "Active"),logit_p =1.0+2.5* (trt =="Quizartinib") -0.05* age -0.1* wbc,outcome =rbinom(n_ipd2, 1, plogis(logit_p)) )# Borrow correlation from IPDrho <-cor(ipd_2d$age, ipd_2d$wbc)cat(sprintf("Empirical Age-WBC correlation (borrowed from IPD): rho = %.3f\n", rho))# AgD with 2D covariate summariesagd_2d <-data.frame(study ="Trial_2",trt =c("Placebo", "Midostaurin"),class =c("Placebo", "Active"),r =c(10, 25), n =c(50, 50),age_mean =70, age_sd =5,wbc_mean =50, wbc_sd =15) |>mutate(wbc_shape = wbc_mean^2/ wbc_sd^2,wbc_rate = wbc_mean / wbc_sd^2)net_2d <-combine_network(set_ipd(ipd_2d, study = study, trt = trt, r = outcome, trt_class = class),set_agd_arm(agd_2d, study = study, trt = trt, r = r, n = n, trt_class = class),trt_ref ="Placebo")# Add 2D Sobol integration with Cholesky-based correlationnet_2d <-add_integration(net_2d,age =distr(qnorm, mean = age_mean, sd = age_sd),wbc =distr(qgamma, shape = wbc_shape, rate = wbc_rate),cor =matrix(c(1, rho, rho, 1), 2, 2), # Cholesky applied internallyn_int =64)cat("2D Sobol integration configured with Cholesky correlation borrowing.\n")cat(sprintf("Integration nodes: 64 per AgD arm | Covariates: age + wbc\n"))
Empirical Age-WBC correlation (borrowed from IPD): rho = 0.029
2D Sobol integration configured with Cholesky correlation borrowing.
Integration nodes: 64 per AgD arm | Covariates: age + wbc
7 R Implementation: Network Construction & Model Fitting
Code
# Build the primary (1D) network for the main analysisnet <-combine_network(set_ipd(ipd_data,study = study, trt = trt, r = y, trt_class = class),set_agd_arm(agd_data,study = study, trt = trt, r = r, n = n, trt_class = class),trt_ref ="Placebo")# Add 64-point Sobol integration for Age in the AgD trialnet <-add_integration(net,age =distr(qnorm, mean = age_mean, sd = age_sd),n_int =64)fit_mlnmr <-nma(net,1trt_effects ="fixed",2regression =~ age * .trt,3class_interactions ="common",prior_intercept =normal(scale =10),prior_trt =normal(scale =10),prior_reg =normal(scale =10),iter =2000,warmup =1000,chains =2,refresh =0)fit_summary <-as.data.frame(rstan::summary(fit_mlnmr$stanfit)$summary)rows_to_keep <-grep("^(d\\[|beta|mu\\[)", rownames(fit_summary))clean_summary <- fit_summary[rows_to_keep, c("mean", "sd", "2.5%", "50%", "97.5%", "n_eff", "Rhat")]knitr::kable(round(clean_summary, 3),caption ="Posterior summary: key model parameters")
1
Fixed Effects — appropriate for sparse 2-trial networks where Random Effects would be over-parameterised and unlikely to converge.
2
Age × treatment interaction — the core regression term that captures effect modification.
3
Common interaction — assumes the age-efficacy decay is shared across TKI treatments (Exchangeability assumption; see Auditor Q&A).
Posterior summary: key model parameters
mean
sd
2.5%
50%
97.5%
n_eff
Rhat
mu[Trial_1]
-1.445
0.209
-1.863
-1.442
-1.024
600.283
1.001
mu[Trial_2]
-2.857
0.330
-3.540
-2.840
-2.240
882.668
0.999
beta[age]
-0.042
0.013
-0.067
-0.042
-0.017
611.812
1.001
beta[age:.trtclassActive]
-0.055
0.022
-0.100
-0.055
-0.011
808.334
1.001
d[Midostaurin]
2.669
0.384
1.929
2.660
3.479
851.215
0.999
d[Quizartinib]
1.973
0.276
1.431
1.971
2.513
645.117
1.002
8 Counterfactual Projection
The shared parameters \(\beta\) allow projection of competitor drug efficacy into any target population baseline.
Code
samples <-as.data.frame(fit_mlnmr)beta_age_val <-mean(samples$`beta[age]`)d_quiz_val <-mean(samples$`d[Quizartinib]`)d_mido_val <-mean(samples$`d[Midostaurin]`)# Shared interaction (class-level)if ("beta[age:.trtclassActive]"%in%names(samples)) { beta_int_val <-mean(samples$`beta[age:.trtclassActive]`)} else { beta_int_val <-NA}mu_cols <-sort(grep("^mu\\[", names(samples), value =TRUE))mu_1_val <-mean(samples[[mu_cols[1]]])mu_2_val <-mean(samples[[mu_cols[2]]])# multinma centers at overall network mean agec_point <- (mean(ipd_data$age) *nrow(ipd_data) +65*nrow(agd_data) *150) / (nrow(ipd_data) +nrow(agd_data) *150)# Adjust estimates back to Age = 45 referenceadj_mu1 <- mu_1_val + beta_age_val * (45- c_point)adj_d_mido <- d_mido_val + beta_int_val * (45- c_point)adj_d_quiz <- d_quiz_val + beta_int_val * (45- c_point)# Counterfactual: Midostaurin in a 35-year-old Trial 1 patienteta_35 <- adj_mu1 + beta_age_val * (35-45) + adj_d_mido + beta_int_val * (35-45)p_35 <-plogis(eta_35)cat(sprintf("Predicted CR probability for Midostaurin in a 35-year-old (Trial 1 baseline): %.1f%%\n", p_35 *100))cat("This is the counterfactual answer ML-NMR was designed to produce.\n")
Predicted CR probability for Midostaurin in a 35-year-old (Trial 1 baseline): 94.7%
This is the counterfactual answer ML-NMR was designed to produce.
Population-aligned relative effects at Age = 45 (Trial 1 target population)
Comparison
Mean (log-OR)
SD
2.5%
97.5%
Rhat
ESS
Mido vs Pbo
3.231
0.530
2.190
4.303
1
763.683
Quiz vs Pbo
2.536
0.222
2.099
2.972
1
1593.333
Quiz vs Mido
-0.696
0.540
-1.790
0.328
1
654.473
9 Parameter Recovery
A key advantage of synthetic DGP is that we can verify the model against known ground truth:
Parameter recovery: estimated vs ground truth (DGP). Note: mu_2 and d_Midostaurin show collinearity-driven bias — see callout below.
Parameter
Estimated (at network center)
Adjusted (at Age 45)
True Value (Age 45)
mu_1 (Trial 1 baseline)
-1.445
-1.152
-1.00
mu_2 (Trial 2 baseline)
-2.857
-2.564
-1.50
beta_age (age main effect)
-0.042
-0.042
-0.05
d_Quizartinib (treatment effect)
1.973
2.363
2.50
d_Midostaurin (treatment effect)
2.669
3.059
2.20
beta_age:trt (shared interaction)
-0.055
-0.055
-0.04
WarningCollinearity in AgD: The Ecological Fallacy Residual
Even with a large AgD sample, the model may fail to accurately decompose Trial 2 baseline risk (\(\mu_2\)) from the Midostaurin treatment effect (\(d_{\text{Mido}}\)). This is because Trial 2 provides only two macro data points (aggregate CR rate for Placebo and Midostaurin), making the two parameters statistically entangled. Their sum is well-recovered, but the individual decomposition is not — a classic consequence of the Ecological Fallacy operating through collinearity.
10 Methodological Boundaries: ML-NMR vs MAIC
Dimension
MAIC
ML-NMR
Network complexity
Restricted to 1-to-1 comparisons
Natively supports multi-study networks
Sample size
ESS shrinkage — can be severe
No sample loss; uses full N
Population overlap
Requires good overlap; ESS collapses otherwise
Handles poor overlap via regression extrapolation
Model assumptions
Semi-parametric; no outcome model required
Fully parametric; outcome model must be specified
Ecological bias
None (re-weights individuals directly)
None (resolves via numerical integration)
Decision rule: Use MAIC for a single pairwise comparison with good overlap and minimal parametric assumptions. Use ML-NMR for multi-trial networks, poor overlap, or non-linear outcomes where STC would produce biased estimates.
11 Auditor’s Stress Test: Strategic Q&A
Q1: Why 64 Sobol points specifically?
64 quasi-random Sobol points provide an optimal balance between computational cost and numerical precision — equivalent in accuracy to approximately 2,000 random Monte Carlo points. The Sobol sequence achieves this efficiency through low-discrepancy sampling: points are space-filling rather than randomly clustered, ensuring every region of the age distribution (including the tails) is adequately represented. Increasing to 128 points yields marginal gains in a 1D integration; the payoff grows more in 2D+ settings.
Q2: Why assume common interactions (\(\beta_{\text{age:trt}}\)) across different drugs?
This is a Mechanism of Action (MoA) argument grounded in the Exchangeability assumption of NICE DSU TSD 18. Both Quizartinib and Midostaurin are FLT3 TKIs acting on the same molecular pathway. We assume the biological decay of TKI efficacy with age follows a similar pharmacodynamic trajectory across the class — not because we can prove it, but because the data from a 2-trial sparse network cannot identify separate interaction terms without severe collinearity. This assumption must be declared transparently in the submission and challenged in sensitivity analysis.
Q3: Why Fixed Effects rather than Random Effects?
In a sparse 2-trial network, Random Effects introduces a between-study heterogeneity parameter (\(\tau^2\)) that cannot be estimated reliably — the model is statistically over-parameterised. MCMC chains typically fail to converge for \(\tau\). The Fixed Effects choice is therefore not a simplifying preference but the only computationally stable option, and should be pre-registered in the statistical analysis plan.
12 Case Study vs NICE TA1013 (Quizartinib)
This mock case deliberately mirrors the structural decisions made in the real Daiichi Sankyo submission for TA1013. The parallel demonstrates that the methodological constraints here are not pedagogical simplifications — they are the actual constraints of state-of-the-art ML-NMR in live HTA submissions.
class_interactions = "common" — forced by sparse data
Collinearity bias
Midostaurin effect overestimated
Quizartinib effect overestimated — EAG flag
Committee outcome
—
Results “lacked face validity” (EAG conclusion)
The TA1013 EAG concluded that population adjustments within the ML-NMR were biased toward overestimating the benefit of Quizartinib — an outcome our synthetic collinearity analysis predicts and explains mechanistically.