Mastering Evidence Synthesis through a Minimal Mock Case
1 Executive Summary: The Anatomy of a Network
Network Meta-Analysis (NMA) is the gold standard for evidence synthesis in Health Technology Assessment (HTA). It extends traditional pair-wise meta-analysis by allowing the simultaneous comparison of multiple treatments, even those that have never been compared head-to-head in a clinical trial.
This guide is designed to take you from a conceptual understanding to a mastery of Standard Bayesian NMA. We use a Minimal Mock Case involving three treatments (A, B, and C) forming a closed loop. This structure allows us to explore the critical concepts of Direct Evidence, Indirect Evidence, and Network Consistency.
TipLearning Strategy
Focus on the transition from the data likelihood to the network constraints. Understanding how indirect paths are calculated is the key to answering advanced interview questions.
2 Theoretical Framework: The Calculus of Consistency
In a Bayesian NMA, we model the data at the study level and connect the treatment effects through a common reference treatment.
2.1 The Likelihood (Binomial-Logit Model)
For a binary outcome (e.g., responders \(r_{jk}\) out of \(n_{jk}\) patients in study \(j\) for treatment \(k\)):
For example, we can have: \[r_{1,base/control} \sim \text{Binomial}(p_{1b}, n_{1b})\]\[r_{1,trt} \sim \text{Binomial}(p_{1,trt}, n_{1,trt})\]
We use the logit link function to map probabilities to the real line:
\(\mu_1\): The baseline log-odds of response in study \(j\) (for the control treatment in that study).
\(\delta_{1,trt}\): The log-odds ratio of treatment \(k\) compared to the study’s control treatment. Note that \(\delta_{1,b} = 0\).
2.2 The Consistency Assumption (The Core Constraint)
The defining feature of NMA is the assumption of consistency. We assume that the relative effect of B vs C can be derived from the relative effects of A vs B and A vs C:
\[d_{BC} = d_{AC} - d_{AB}\]
Where \(d_{xy}\) is the pooled relative effect (log-odds ratio) of treatment \(y\) vs treatment \(x\).
2.3 Fixed Effects (FE) vs. Random Effects (RE)
Fixed Effects Model: Assumes that the true relative effect of treatment \(k\) vs \(b\) is constant across all studies: \(\delta_{jk} = d_{bk}\).
Random Effects Model: Assumes that the true relative effects vary across studies following a normal distribution: \(\delta_{jk} \sim \text{Normal}(d_{bk}, \tau^2)\), where \(\tau^2\) is the between-study heterogeneity variance.
ImportantJudgment Logic for FE vs. RE (HTA Audit Focus)
Deciding between a Fixed Effects and a Random Effects model is a critical decision in NMA. Here is the standard judgment logic:
Clinical Plausibility: Are the trials sufficiently similar in terms of patient populations (e.g., age, baseline severity), treatment protocols, and outcome definitions? If there are obvious differences, an RE model is clinically more plausible.
Statistical Heterogeneity: - Check the estimated between-study standard deviation (\(\tau\)). Is it large relative to the treatment effects? - Look at the \(I^2\) statistic from pairwise meta-analyses.
Model Fit (DIC): Compare the Deviance Information Criterion (DIC) between FE and RE models. A smaller DIC indicates a better fit. A difference of 3-5 points is usually considered meaningful. If DIC is similar, the simpler FE model is often preferred (principle of parsimony).
Network Sparsity: If the network is sparse (very few studies per comparison), the RE model may struggle to estimate \(\tau\) accurately, leading to wide credible intervals. In such cases, an FE model or an RE model with an informative prior for \(\tau\) may be necessary.
3 The Minimal Mock Case: A vs B vs C
Treatment A: Standard of Care (SoC) - Reference
Treatment B: New Drug 1
Treatment C: New Drug 2
To make this concrete, let’s construct a minimal network with 3 treatments and 3 studies forming a complete loop.
3.1 The Mock Data
Study
Arm 1 (Control)
Arm 2 (Active)
\(n_1\)
\(r_1\)
\(n_2\)
\(r_2\)
Observed OR
Study 1
A
B
100
20
100
40
2.67
Study 2
A
C
100
20
100
50
4.00
Study 3
B
C
100
30
100
40
1.55
Where:
\(n_1, n_2\): Total number of patients (sample size) in Arm 1 and Arm 2 respectively.
\(r_1, r_2\): Number of responders (events) in Arm 1 and Arm 2 respectively.
Observed OR (Odds Ratio) is calculated as: \(OR = \frac{r_2 / (n_2 - r_2)}{r_1 / (n_1 - r_1)}\).
For example, in Study 1: \(OR = \frac{40 / (100 - 40)}{20 / (100 - 20)} \approx 2.67\).
3.2 The Calculus of the Loop
Let’s calculate the log-odds ratios (LOR) for each study:
Study 1 (A vs B): \(\ln(OR_{AB}) = \ln(2.67) \approx 0.982\)
Study 2 (A vs C): \(\ln(OR_{AC}) = \ln(4.00) \approx 1.386\)
Study 3 (B vs C): \(\ln(OR_{BC}) = \ln(1.55) \approx 0.438\)
The Indirect Check: According to the consistency assumption, the indirect estimate of B vs C should be: \[\ln(OR_{BC, \text{ind}}) = \ln(OR_{AC}) - \ln(OR_{AB}) = 1.386 - 0.982 = 0.404\]
This difference is very small, indicating a consistent network.
4 MCMC Forensic Audit: How Bayes Actually Solves the Network
In this section, we move away from simple arithmetic and look at how the Bayesian MCMC engine actually processes our mock data to find the global solution.
4.1 Step 1: Setting up the Likelihood and Parameters
For our 3 studies, the model views the data as follows. Let \(r_{jk}\) be the number of events in study \(j\), arm \(k\), and \(n_{jk}\) be the total patients.
The formal mathematical likelihood \(L\) for the entire network is the product of the probability mass functions (PMF) of the independent Binomial distributions for all studies and arms:
Now let’s look at the specific setup for our 3 studies:
Study
Arm
Treatment
Distribution
Linear Predictor (\(\text{logit}(p)\))
Study 1
1
A (SoC)
\(r_{11} \sim \text{Binomial}(p_{11}, 100)\)
\(\mu_1\)
2
B
\(r_{12} \sim \text{Binomial}(p_{12}, 100)\)
\(\mu_1 + d_{AB}\)
Study 2
1
A (SoC)
\(r_{21} \sim \text{Binomial}(p_{21}, 100)\)
\(\mu_2\)
2
C
\(r_{22} \sim \text{Binomial}(p_{22}, 100)\)
\(\mu_2 + d_{AC}\)
Study 3
1
B
\(r_{31} \sim \text{Binomial}(p_{31}, 100)\)
\(\mu_3\)
2
C
\(r_{32} \sim \text{Binomial}(p_{32}, 100)\)
\(\mu_3 + d_{BC}\)
The Consistency Constraint (The Golden Rule): At all times, the model enforces: \(d_{BC} = d_{AC} - d_{AB}\). This means we only have 5 independent parameters to estimate: \(\mu_1, \mu_2, \mu_3, d_{AB}, d_{AC}\). All study data are pooled simultaneously to find the best values for these 5 parameters.
4.2 Step 2: Assigning Priors (The Bayesian Part)
Since we have no prior knowledge in this mock case, we use vague (non-informative) priors:
Study baselines: \(\mu_1, \mu_2, \mu_3 \sim \text{Normal}(0, 1000)\)
In real HTA submissions, prior choice is heavily scrutinized:
Priors for Treatment Effects (\(d\)): Typically use weakly informative priors like \(N(0, 100^2)\) rather than extremely flat priors like \(N(0, 1000^2)\), to prevent the sampler from exploring unrealistically large effect sizes while still letting the data dominate.
Priors for Heterogeneity (\(\tau\)) (in RE models): This is the most sensitive choice!
Uniform Prior: e.g., \(\tau \sim \text{Uniform}(0, 2)\) or \(\text{Uniform}(0, 5)\). Easy to implement but can place too much weight on large values if data is sparse.
Half-Normal Prior: e.g., \(\tau \sim \text{Half-Normal}(0, 1)\). Pulls \(\tau\) towards zero, acting as a shrinkage prior.
Informative Priors: When the network is sparse, it is best practice to use empirical informative priors based on historical meta-analyses (e.g., Turner et al., 2012 for binary outcomes). This provides a realistic estimate of \(\tau\) even when the current data has little information about it.
How Priors and Likelihood Connect in MCMC:
According to Bayes’ Theorem, the Posterior (the target distribution MCMC tries to map) is proportional to the Likelihood multiplied by the Prior: \[\text{Posterior} \propto \text{Likelihood} \times \text{Prior}\]
NoteIn our minimal mock case
Flat Priors: Because we use a huge variance (1000), the prior distribution is extremely flat. The prior probability of guessing \(d_{AB} = 0.982\) is almost identical to guessing \(d_{AB} = 10.0\).
Likelihood Dominance: Therefore, when the MCMC engine calculates the score of a guess, the Prior part contributes almost nothing. The score is dominated by the Likelihood (how well the parameters fit the 100-patient data in the table).
Guiding the Walk: When MCMC decides whether to move to a new guess, it compares the ratio of the scores. Since the priors are flat, the decision is almost entirely based on whether the new guess makes the observed data more likely than the old guess.
NoteWhat exactly are we multiplying in a single MCMC step?
It is NOT a distribution.
It is NOT a random draw from \(\text{Normal}(0, 1000)\).
It is a specific number (probability density). In each MCMC step, the engine has a specific guess for the parameters (e.g., \(d_{AB} = 0.982\)). The “Prior” in the formula is the function value of the probability density function (PDF) of \(\text{Normal}(0, 1000)\) at the point \(0.982\). Similarly, the “Likelihood” is the probability of seeing the data given this guess (a single number). MCMC multiplies these two numbers to score the current guess.
4.3 The Iterative Algorithm and Posterier.
Proposal: The engine guesses values for the parameters.
Suppose it guesses: \(\mu_1=-1.386, \mu_2=-1.386, \mu_3=-0.847, d_{AB}=0.982, d_{AC}=1.386\).
Derivation: It calculates \(d_{BC} = 1.386 - 0.982 = 0.404\) (forced by consistency).
Likelihood Calculation: It calculates the probability of seeing our mock data given these guesses:
For Study 1: \(p_{11} = \text{logit}^{-1}(-1.386) = 0.20\). \(r_{11}=20\) fits perfectly!
For Study 3: \(p_{31} = \text{logit}^{-1}(-0.847) = 0.30\). \(r_{31}=30\) fits perfectly!
Acceptance: If the guessed values make the data highly probable, the engine accepts them and records them. If not, it might reject them.
The engine repeats this thousands of times (e.g., 50,000 iterations).
After 50,000 steps, we don’t get single numbers. We get thousands of recorded values for \(d_{AB}, d_{AC}\), and the derived \(d_{BC}\). This is the simulated Posterier distribution.
The median of the recorded values for \(d_{AB}\) becomes our pooled estimate.
The 2.5th and 97.5th percentiles become our 95% Credible Interval (CrI).
4.4 Model Implementation: BUGS/JAGS & Stan
Here we present the implementation of our Fixed Effects NMA model in both the traditional BUGS/JAGS language and the modern Stan language, along with the R code to call them.
4.4.1 BUGS/JAGS Implementation
Here is the actual code used to run this simulation in JAGS:
Iterations = 1001:11000
Thinning interval = 1
Number of chains = 3
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
OR_AB 2.829 0.8028 0.004635 0.013432
OR_AC 4.122 1.3435 0.007757 0.024485
OR_BC 1.510 0.4724 0.002727 0.006338
d_AB 1.001 0.2771 0.001600 0.004612
d_AC 1.367 0.3118 0.001800 0.005662
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
OR_AB 1.5907 2.2541 2.719 3.286 4.713
OR_AC 2.1750 3.1735 3.908 4.807 7.395
OR_BC 0.7973 1.1737 1.438 1.764 2.641
d_AB 0.4642 0.8128 1.000 1.190 1.550
d_AC 0.7770 1.1548 1.363 1.570 2.001
4.4.2 Stan Implementation
The exact same functionality can be implemented in Stan. Notice that Stan uses standard deviation instead of precision in its normal distribution, and uses vectorized operations.
NoteSTAN nma_model.stan
data {int<lower=1> J; // Number of studiesint r[J, 2]; // Eventsint n[J, 2]; // Total patients}parameters {real mu[J]; // Study baselinesreal d_AB; // Effect of B vs Areal d_AC; // Effect of C vs A}transformed parameters {real delta[J, 2];// Study 1: A vs B delta[1, 1] = 0; delta[1, 2] = d_AB;// Study 2: A vs C delta[2, 1] = 0; delta[2, 2] = d_AC;// Study 3: B vs C delta[3, 1] = 0; delta[3, 2] = d_AC - d_AB; // Consistency constraint}model {1 mu ~ normal(0, 100); d_AB ~ normal(0, 100); d_AC ~ normal(0, 100);for (j in1:J) {for (k in1:2) {2 r[j, k] ~ binomial_logit(n[j, k], mu[j] + delta[j, k]); } }}generated quantities {real OR_AB = exp(d_AB);real OR_AC = exp(d_AC);real OR_BC = exp(d_AC - d_AB);}
1
Priors (Stan uses standard deviation, so 100 means variance 10000)
2
Likelihood
To run this model in R, you would use the rstan package:
Inference for Stan model: anon_model.
3 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=3000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
d_AB 1.00 0.01 0.28 0.45 0.82 1.00 1.19 1.55 1359 1
d_AC 1.36 0.01 0.30 0.78 1.16 1.36 1.57 1.97 1433 1
OR_AB 2.83 0.02 0.80 1.56 2.26 2.72 3.28 4.70 1331 1
OR_AC 4.10 0.03 1.28 2.17 3.19 3.90 4.82 7.14 1475 1
OR_BC 1.50 0.01 0.47 0.80 1.17 1.44 1.76 2.61 1863 1
Samples were drawn using NUTS(diag_e) at Sat May 9 22:48:47 2026.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
NoteComparison: BUGS vs JAGS vs Stan
While the code above is written in the BUGS/JAGS language, it is helpful to understand the differences between the three major MCMC engines used in HEOR:
BUGS (WinBUGS / OpenBUGS): The pioneer. It uses Gibbs sampling. It is the oldest and can be slow or unstable for complex models.
JAGS (Just Another Gibbs Sampler): A C++ reimplementation of BUGS. It uses the same language syntax but is cross-platform and more stable. It is currently the most widely used engine for traditional NMA (e.g., in NICE DSU technical support documents).
Stan: The modern standard. It uses Hamiltonian Monte Carlo (HMC) and the No-U-Turn Sampler (NUTS). Instead of random walking, it uses the gradient of the log-posterior to “glide” efficiently through the parameter space. It is much faster and more reliable for complex, high-dimensional models (like ML-NMR), though it requires compiling the model to C++ first.
NoteBUGS/JAGS: Software or Language?
BUGS and JAGS are both standalone software (computational engines) and modeling languages. - BUGS (WinBUGS/OpenBUGS) has its own language syntax and originally had a GUI. - JAGS is a command-line program with no GUI. In modern HEOR practice, they are almost always used as “plugins” called from R (via packages like rjags or R2jags), rather than run standalone.
4.5 Convergence Diagnostics
When auditing an NMA, always verify the following MCMC diagnostics to ensure the posterior simulation has converged:
R-hat (Gelman-Rubin Diagnostic): Must be \(< 1.05\) (or strictly \(< 1.01\)) for all parameters. It compares between-chain and within-chain variance. Values close to 1 indicate convergence.
Trace Plots: Chains should look like “fat hairy caterpillars,” showing good mixing and no trends.
Auto-correlation: Should drop to zero quickly as the lag increases. High autocorrelation means the sampler is moving slowly and requires more iterations or thinning.
Effective Sample Size (ESS): Should be sufficiently large (typically \(> 400\), ideally \(> 1000\)). For calculating tail probabilities (like ranking probabilities or SUCRA), a larger ESS is required to ensure stability.
If the model does not converge, potential solutions include:
Increasing the number of burn-in and iteration steps.
Checking for data entry errors or extremely sparse data (zero cells).
Using more informative priors to stabilize the model.
Reparameterizing the model (e.g., non-centered parameterization in Stan).
5 Results & Interpretation: The League Table
After running the MCMC, we combine the direct and indirect evidence to get the final pooled estimates.
5.1 The League Table (Median OR and 95% CrI)
Treatment A
2.67 (1.5, 4.8)
4.00 (2.2, 7.3)
0.37 (0.2, 0.7)
Treatment B
1.50 (0.8, 2.7)
0.25 (0.1, 0.5)
0.67 (0.4, 1.2)
Treatment C
How to read: The upper-right triangle shows the OR of the column treatment vs the row treatment. The lower-left triangle shows the inverse.
5.2 Probability of Being Best
Bayesian NMA allows us to calculate the probability that each treatment is the most effective.
Treatment C: ~75%
Treatment B: ~24%
Treatment A: ~1%
6 Strategic Q&A: Interview Prep
ImportantWhat is the difference between a Consistency Model and an Unrelated Mean Effects (UME) model?
A consistency model forces all loops to close (e.g., \(d_{BC} = d_{AC} - d_{AB}\)). A UME model relaxes this assumption and estimates all relative effects independently. We compare the two to check for network inconsistency.
ImportantHow do you handle a disconnected network in Standard NMA?
Standard NMA cannot handle disconnected networks. You must either bridge the gap with a common comparator (if possible) or use population adjustment methods like MAIC or ML-NMR if you have IPD for at least one trial.
7 Strategic Conclusion
Mastering Standard Bayesian NMA requires understanding the interplay between direct and indirect evidence. This minimal case demonstrates that when data forms a loop, consistency must be verified.